Open Abhinay1997 opened 2 months ago
same question
My best guess is that the multi-frame model was finetuned on relatively few samples, and then the output of the newly trained model was used to train/finetune the single-frame model. I'm not sure how you would do this otherwise.
\<Hypothetical> If that's what they did, I think this sort of training scheme could lead to a whole other level of AI models. For example, you could train a multi-frame model to pan and rotate around characters, and then fine-tune a single-frame model to generate alternative views of a character. \</Hypothetical>
So yeah, same question here.
same question
I understand why this was trained, but I don't fully understand how. For example, was it trained on end-to-end painting procedures, or some separated stages like line arts, colors, and details? How much synthetic data were used, like in some previous works of line art extraction from the same author? I'll ask the author to share more details
I speculate that the dataset is similar to the ControlNet dataset, consisting of pairs of images where one is a line art sketch with progressively added lines and the other is the corresponding full image.like this: [image_line(5),image_line(10),image_line(15),image_line(20),image_line(25) : produt_image] , [image_line(5),image_line(10),image_line(15),image_line(20),image_line(25) : produt_image] , [image_line(5),image_line(10),image_line(15),image_line(20),image_line(25) : produt_image] , [image_line(5),image_line(10),image_line(15),image_line(20),image_line(25) : produt_image] , The number returned by the function 'image_line(num)' represents the count of lines within the given image. just guess. I am not a professional.
I thought they came from live drawing videos on Youtube or something, because some examples frequently switches the visibility of layers. ( https://lllyasviel.github.io/pages/paints_undo/01showcase/61e47efc-b964-460d-8167-749685c51aeb_x264.mp4 ) However, these examples never zoom in, which is frequent actions in live drawing. Does .psd in the wild contain its action history?
I thought they came from live drawing videos on Youtube or something, because some examples frequently switches the visibility of layers. ( https://lllyasviel.github.io/pages/paints_undo/01showcase/61e47efc-b964-460d-8167-749685c51aeb_x264.mp4 ) However, these examples never zoom in, which is frequent actions in live drawing. Does .psd in the wild contain its action history?
I would assume it was trained with live drawing videos too. Essentially, you download the raw video, and subsample the frames. Outliers (such as zoom-ins) can be automatically removed by measuring the structural similarity between the current frame and the final outputs.
Guess it won’t be long before artists will need to add digital watermarking to their works to prevent them from being used as training data.
Hi all, as an action of GitHub ToS we must delete some comments.
For comments that are unrelated to technical codes, please follow the instructions here.
@lllyasviel
I can do 30 to 120 frames per second on RTX3090 if you are using SD 1.5 backbone/unet and the output frames from successive generations are similar (which is the case here).
I want to merged in these changes or test them, for this use case, to lower generation time. 5 minutes per image is too long. I think it can be down to 5 to 10 seconds.
However, this repo does not have enough technical details.
It would help the community contribute to this model, if there were more details in the readme. It does not explain at all, how it works or what it is doing.
I thought they came from live drawing videos on Youtube or something, because some examples frequently switches the visibility of layers. ( https://lllyasviel.github.io/pages/paints_undo/01showcase/61e47efc-b964-460d-8167-749685c51aeb_x264.mp4 ) However, these examples never zoom in, which is frequent actions in live drawing. Does .psd in the wild contain its action history?
the same as latent diffusion training but diffuse based on full res images instead of downscaling to 64x64 latent image tiles. like take a model that understands rough sketch, mid workflow, and finished representations of the same subject. start with a "finished" image, and image to image it over and over with low steps and incrementally higher and higher denoise and cfg, with a prompt like "basic, unfinished, low_quality, rough_sketch line_art of [insert auto tagger tags]". create a new classifier model that can grade an image's "completeness" based on the % pure-white/pure-black pixels. train a convolutional network adapter to understand "completeness" embedding as a function of time-step so it can next-token predict feature changes between %'s. now you have a model that can look at a "finished/100%" input , and 100% denoise it into 0% complete sketch , use 0% as first keyframe, and "finished" input as last keyframe, then for each sampling step diffuse the white and black pixels away
(Chatgpt after reading all the docs and researching the technical aspects, and analyzing my assessment of the training process)
Input: "would this require ever seeing a real artists actual transparency layers, or frame scraping youtube videos to emulate "layer switching", or is features "popping in, and popping out" between sampling steps an integral function of denoise sampling during diffusion generation? "
Output: "The approach to training Paints-UNDO does not necessarily require access to real artists' actual transparency layers or frame scraping from YouTube videos. Instead, the model's ability to emulate "layer switching" and the "popping in and out" of features is inherently managed by the denoise sampling process within the diffusion generation technique. Here's how this works:
In summary, the iterative refinement process and the "popping in and out" of features during diffusion generation are integral functions of the denoise sampling method. This approach allows the model to simulate the artist's workflow and generate intermediate stages of artwork without the need for real artists' transparency layers or scraping video frames. The key lies in the diffusion model's training process, which captures the essence of layer switching and feature evolution through noise and denoising steps."
@noxiouscardiumdimidium Would you give me the sources cited in the ChatGPT's response? I heared a method to synthesis HDR images from the images generated by various prompts (https://arxiv.org/abs/2312.09168 Section 3.3) but I don't know the method of more semantical interpolation such as artistic process.
Would love to know how the training data was created for this model. This looks and works like a T2V model but curious as to how the layers/timesteps were curated for the training.