lllyasviel / Paints-UNDO

Understand Human Behavior to Align True Needs
Apache License 2.0
3.27k stars 288 forks source link

Dataset and Training details ? #4

Open Abhinay1997 opened 2 months ago

Abhinay1997 commented 2 months ago

Would love to know how the training data was created for this model. This looks and works like a T2V model but curious as to how the layers/timesteps were curated for the training.

betterze commented 2 months ago

same question

PladsElsker commented 2 months ago

My best guess is that the multi-frame model was finetuned on relatively few samples, and then the output of the newly trained model was used to train/finetune the single-frame model. I'm not sure how you would do this otherwise.

\<Hypothetical> If that's what they did, I think this sort of training scheme could lead to a whole other level of AI models. For example, you could train a multi-frame model to pan and rotate around characters, and then fine-tune a single-frame model to generate alternative views of a character. \</Hypothetical>

So yeah, same question here.

MarkMoHR commented 2 months ago

same question

woct0rdho commented 2 months ago

I understand why this was trained, but I don't fully understand how. For example, was it trained on end-to-end painting procedures, or some separated stages like line arts, colors, and details? How much synthetic data were used, like in some previous works of line art extraction from the same author? I'll ask the author to share more details

Erwin11 commented 2 months ago

I speculate that the dataset is similar to the ControlNet dataset, consisting of pairs of images where one is a line art sketch with progressively added lines and the other is the corresponding full image.like this: [image_line(5),image_line(10),image_line(15),image_line(20),image_line(25) : produt_image] , [image_line(5),image_line(10),image_line(15),image_line(20),image_line(25) : produt_image] , [image_line(5),image_line(10),image_line(15),image_line(20),image_line(25) : produt_image] , [image_line(5),image_line(10),image_line(15),image_line(20),image_line(25) : produt_image] , The number returned by the function 'image_line(num)' represents the count of lines within the given image. just guess. I am not a professional.

Yosshi999 commented 2 months ago

I thought they came from live drawing videos on Youtube or something, because some examples frequently switches the visibility of layers. ( https://lllyasviel.github.io/pages/paints_undo/01showcase/61e47efc-b964-460d-8167-749685c51aeb_x264.mp4 ) However, these examples never zoom in, which is frequent actions in live drawing. Does .psd in the wild contain its action history?

tijiang13 commented 2 months ago

I thought they came from live drawing videos on Youtube or something, because some examples frequently switches the visibility of layers. ( https://lllyasviel.github.io/pages/paints_undo/01showcase/61e47efc-b964-460d-8167-749685c51aeb_x264.mp4 ) However, these examples never zoom in, which is frequent actions in live drawing. Does .psd in the wild contain its action history?

I would assume it was trained with live drawing videos too. Essentially, you download the raw video, and subsample the frames. Outliers (such as zoom-ins) can be automatically removed by measuring the structural similarity between the current frame and the final outputs.

Guess it won’t be long before artists will need to add digital watermarking to their works to prevent them from being used as training data.

lllyasviel commented 2 months ago

Hi all, as an action of GitHub ToS we must delete some comments.

For comments that are unrelated to technical codes, please follow the instructions here.

haltingstate commented 2 months ago

@lllyasviel

I can do 30 to 120 frames per second on RTX3090 if you are using SD 1.5 backbone/unet and the output frames from successive generations are similar (which is the case here).

I want to merged in these changes or test them, for this use case, to lower generation time. 5 minutes per image is too long. I think it can be down to 5 to 10 seconds.

However, this repo does not have enough technical details.

It would help the community contribute to this model, if there were more details in the readme. It does not explain at all, how it works or what it is doing.

noxiouscardiumdimidium commented 2 months ago

I thought they came from live drawing videos on Youtube or something, because some examples frequently switches the visibility of layers. ( https://lllyasviel.github.io/pages/paints_undo/01showcase/61e47efc-b964-460d-8167-749685c51aeb_x264.mp4 ) However, these examples never zoom in, which is frequent actions in live drawing. Does .psd in the wild contain its action history?

the same as latent diffusion training but diffuse based on full res images instead of downscaling to 64x64 latent image tiles. like take a model that understands rough sketch, mid workflow, and finished representations of the same subject. start with a "finished" image, and image to image it over and over with low steps and incrementally higher and higher denoise and cfg, with a prompt like "basic, unfinished, low_quality, rough_sketch line_art of [insert auto tagger tags]". create a new classifier model that can grade an image's "completeness" based on the % pure-white/pure-black pixels. train a convolutional network adapter to understand "completeness" embedding as a function of time-step so it can next-token predict feature changes between %'s. now you have a model that can look at a "finished/100%" input , and 100% denoise it into 0% complete sketch , use 0% as first keyframe, and "finished" input as last keyframe, then for each sampling step diffuse the white and black pixels away

(Chatgpt after reading all the docs and researching the technical aspects, and analyzing my assessment of the training process)

Input: "would this require ever seeing a real artists actual transparency layers, or frame scraping youtube videos to emulate "layer switching", or is features "popping in, and popping out" between sampling steps an integral function of denoise sampling during diffusion generation? "

Output: "The approach to training Paints-UNDO does not necessarily require access to real artists' actual transparency layers or frame scraping from YouTube videos. Instead, the model's ability to emulate "layer switching" and the "popping in and out" of features is inherently managed by the denoise sampling process within the diffusion generation technique. Here's how this works:

Diffusion-Based Sampling

  1. Denoise Sampling:
    • The process of denoise sampling in diffusion models involves starting with a noisy version of the image and gradually refining it to produce a cleaner, more detailed output. During each denoising step, features can appear and disappear based on the noise level and the conditioning inputs.
    • As the model denoises the image step by step, it learns to add details progressively. This inherently creates an effect similar to layers "popping in and out," which simulates the iterative process of adding and refining details in a digital artwork【15†source】【16†source】.

Training Without Explicit Layer Data

  1. Iterative Refinement:
    • By training on full-resolution images and using a prompt-based approach to generate different stages of completeness, the model can learn the intermediate steps of the drawing process without needing explicit data on individual transparency layers.
    • The use of a classifier to determine the "completeness" of an image based on pixel analysis allows the model to understand and generate intermediate states that mimic the artist's workflow from rough sketch to finished piece【17†source】【18†source】.

Model Training and Conditioning

  1. Simulating Artistic Process:
    • During training, the model is exposed to various stages of artwork, from rough sketches to fully detailed images. This exposure helps the model learn the transition and refinement process, capturing how features typically evolve in a drawing.
    • The iterative application of noise and denoising, combined with specific prompts, enables the model to create realistic transitions that resemble layer switching, without having to scrape frame data from videos【15†source】【17†source】.

Practical Implementation

  1. Keyframes and Diffusion:
    • The method involves creating keyframes at different stages of the artwork (e.g., rough sketch, mid-workflow, finished image) and using these keyframes to guide the diffusion process. This allows the model to interpolate between different stages, effectively simulating the addition and removal of details as seen in layer-based digital art creation.
    • The progressive denoising ensures that each step gradually adds more details, making it look like features are being added layer by layer, even though the model does not explicitly use transparency layers【18†source】【19†source】.

In summary, the iterative refinement process and the "popping in and out" of features during diffusion generation are integral functions of the denoise sampling method. This approach allows the model to simulate the artist's workflow and generate intermediate stages of artwork without the need for real artists' transparency layers or scraping video frames. The key lies in the diffusion model's training process, which captures the essence of layer switching and feature evolution through noise and denoising steps."

Yosshi999 commented 2 months ago

@noxiouscardiumdimidium Would you give me the sources cited in the ChatGPT's response? I heared a method to synthesis HDR images from the images generated by various prompts (https://arxiv.org/abs/2312.09168 Section 3.3) but I don't know the method of more semantical interpolation such as artistic process.