huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.47k stars 5.45k forks source link

[Feature Request] Dreambooth - Save intermediate checkpoints #732

Closed DominikDoom closed 1 year ago

DominikDoom commented 2 years ago

Is your feature request related to a problem? Please describe. Dreambooth can drastically change its output quality between step counts, including to the worse if the chosen learning rate is too high for the step count or amount of training / regularization images. This implementation only saves the model after training is finished, which requires full reruns to compare different step counts and also makes it impossible to salvage an overfitted model.

Describe the solution you'd like A configurable way to save the model at certain step counts and continue training afterwards. Optimally, the script would accept two new parameters, one to specify the step interval to save at and one to specify how many to keep before overwriting. In some of the popular non-diffuser-implementations like https://github.com/XavierXiao/Dreambooth-Stable-Diffusion and resulting forks, these arguments are called every_n_train_steps and save_top_k. However, since this implementation doesn't generate intermediate checkpoints by default, it would probably be better to find a more descriptive name.

Describe alternatives you've considered Technically, it would also be possible to just manually resume training from a previous checkpoint and use low step counts for each run, however this requires additional effort and also is hard to do in some Colabs based on this implementation, so an integrated solution would be preferred.

Additional context I tried a naive implementation by simply calling pipeline.save_pretrained after every X steps, however this would lead to an error after successfully saving a few files:

File "/diffusers/pipeline_utils.py", line 158, in save_pretrained save_method = getattr(sub_model, save_method_name)
TypeError: getattr(): attribute name must be string

I called the method in the same way as the final save, including a call to accelerator.wait_for_everyone() beforehand, as suggested in the Accelerate documentation. Since I am not familiar with the Accelerate and Stable Diffusion architectures, I couldn't find out why so far, but from the error message it seems that StableDiffusionPipeline could not find a valid save method name due to missing some information about the model at this point.

briansemrau commented 2 years ago

The error getattr(): attribute name must be string may be caused by replacing the safety checker with a dummy function. save_pretrained is unable to identify the model type of the function, so it fails.

Cyberes commented 2 years ago

manually resume training from a previous checkpoint

Are you doing this with train_dreambooth.py by setting --pretrained_model_name_or_path or model.from_pretrained() to the last DreamBooth checkpoint?

DominikDoom commented 2 years ago

@Cyberes

manually resume training from a previous checkpoint

Are you doing this with train_dreambooth.py?

Yes, the script takes --pretrained_model_name_or_path as an argument. In most Colabs etc. it just uses the SD Huggingface repo, but you can also point to any other pretrained diffusers model in the same way. So as long as you keep the original diffusers model created at the end with save_pretrained, you could use it as the input for the next training. I don't know how it affects the final quality though for dreambooth specifically.

Cyberes commented 2 years ago

Ok cool. I wasn't sure if I could do that with --pretrained_model_name_or_path.

I don't know how it affects the final quality though for dreambooth specifically.

I'll train a model on 100 steps and another on 800 + 200 continue and see what happens (same seeds).

patrickvonplaten commented 2 years ago

Gently pinging @patil-suraj here

Cyberes commented 2 years ago

I'm getting an error when continuing from a dreambooth model immediately after training finishes during what I assume is the saving process.

Traceback (most recent call last):
  File "train_dreambooth.py", line 585, in <module>
    main()
  File "train_dreambooth.py", line 573, in main
    pipeline = StableDiffusionPipeline.from_pretrained(
  File "/usr/local/lib/python3.9/dist-packages/diffusers/pipeline_utils.py", line 482, in from_pretrained
    raise ValueError(
ValueError: Pipeline <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> expected {'vae', 'scheduler', 'text_encoder', 'feature_extractor', 'safety_checker', 'unet', 'tokenizer'}, but only {'vae', 'scheduler', 'text_encoder', 'unet', 'tokenizer'} were passed.

I can create a new issue if necessary.

patil-suraj commented 2 years ago

@DominikDoom Thanks a lot for the issue, working on adding intermidiate checkpoint saving.

@Cyberes It seems that the safety checker is not saved in the model that you are passing, that's what the error indicated, make sure they safety checker is also saved there. Feel free to open an issue if the error persists even after that.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patrickvonplaten commented 2 years ago

Gently ping here @patil-suraj

patil-suraj commented 2 years ago

The script is now updated to save intermediate checkpoints cf https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py#L675 just pass the --save_steps argument to the command.

Will close this issue now.

patrickvonplaten commented 1 year ago

Keeping this open as intermediate checkpoint saving only really makes sense if we have a way of "resuming" checkpoint loading and if we in addition also save the accelerator states.

We should instead make use of the functionality provided by accelerate: https://huggingface.co/docs/accelerate/v0.15.0/en/usage_guides/checkpoint#checkpointing

Let's try to implement this for both dreambooth and textual inversion in a clean way.

We can just copy-paste the logic from transformers here: https://github.com/huggingface/transformers/blob/799cea64ac1029d66e9e58f18bc6f47892270723/examples/pytorch/language-modeling/run_clm_no_trainer.py#L598

patrickvonplaten commented 1 year ago

Also I think it makes a lot of sense to only save the last n checkpoints. This however should also be handled by accelerate

patrickvonplaten commented 1 year ago

Regarding the feature request of saving only the last "n" checkpoints I opened a feature request in accelerate as this is probably the better place to implement such a feature: https://github.com/huggingface/accelerate/issues/914

cc @pcuenca

pcuenca commented 1 year ago

Fixed by #1668 (except keeping the last n checkpoints, to be adapted from https://github.com/huggingface/accelerate/issues/914).

jpc commented 1 year ago

Hey, sorry to say that but I am not sure if #1668 is actually an improvement for this usecase. :/

I think the use case was to be able to use the checkpointed models only to generate images (to quickly do a sweep over different training lengths and compare the images). Not sure if the users care about resuming training (fine-tuning with DreamBooth is pretty fast anyways).

With the change the checkpoints are 14GB and cannot be used as in StableDiffusionPipeline.from_pretrained which kind of defeats this purpose. When I reverted this change my checkpoints are 5GB and I can easily generate images from the intermediate models and review them.

patrickvonplaten commented 1 year ago

@jpc the size of the saved checkpoints shouldn't change - it's just that by default more checkpoints are saved. This means that from_pretrained(...) works on each saved checkpoint.

You can however decide to only save the last checkpoint via: https://github.com/huggingface/accelerate/issues/914 if you'd like to save memory.

I also do think that #1668 exactly solves what was asked for by @DominikDoom no?

@pcuenca maybe we could by default only save the last checkpoint?

DominikDoom commented 1 year ago

@patrickvonplaten

I also do think that https://github.com/huggingface/diffusers/pull/1668 exactly solves what was asked for by @DominikDoom no?

Yes, that was what I asked for. Going back to previous checkpoints to continue training or a proper implementation for resuming in general was what I had in mind. Especially to enable multi-session-training on colab without losing progress and to make use of higher step counts. Comparing different models was a part of my initial question, but notably with the requirement of "no full reruns" which of course also requires intermediate states and resume ability.

jpc commented 1 year ago

@DominikDoom Thanks for the explanation and sorry for misunderstanding your needs.

@patrickvonplaten Ok, I missed https://github.com/huggingface/diffusers/blob/main/docs/source/en/training/dreambooth.mdx#performing-inference-using-a-saved-checkpoint which explains why it was not working for me – you need to manually convert each accelerate checkpoint to an inference model. But I think it also means that my observation that from_pretrained does not work on checkpoints is, strictly speaking, correct (one needs to do the conversion first).

I don't want to sound demanding and I am very grateful for your excellent work. I just wanted to explain myself better (feel free to ignore it, I am not trying to make a feature request):

I think for the size part we are talking about different things. I think the old-style snapshots (compared to the full checkpoints) are a lot smaller because they do not have any of the optimizer state that is needed to resume training. I've switched back to the old implementation (the one that creates a new pipeline and does pipeline.save_pretrained on it instead of accelerator.save_state) and the snapshots are a lot smaller and directly usable as input to from_pretrained.

I understand that this offers limited functionality but just wanted to note that I find it very useful – in my case I only use the snapshots of the model to figure out when it started to overfit. It allows me to easily find the best snapshot and use it as my final model without retraining. This use case is of course also supported with checkpoints but: it requires more disk space (which I believe is not fixable) and has the manual conversion step (which maybe could be automated away inside from_pretrained).

patrickvonplaten commented 1 year ago

I see now where the misunderstanding comes from! Think it makes perfect sense to use save_pretrained(...) instead if it better fits your use case :-)

martin-haynes commented 1 year ago

This discussion is very helpful. Big thank you to the sponsors and contributors for taking the time to discuss and explain.

Since the history is still fairly recent, I'll risk a bit of noise to call attention to the fix that @pcuenca applied in commit 31336dae3. Not that my investigation wasn't valuable to understanding the code base, but a better use of time for anyone arriving here is likely to pull the repo and use it as source of truth for your local build environment. :)

Keeping with SO etiquette, I'll explain briefly that prior to the fix,first_epoch was defined in terms of the gradient_accumulation_steps, while the num_update_steps_per_epoch was initialized with this scalar value factored out.

# We need to recalculate our total training steps as the size of the training dataloader may have changed.
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)

As a result, when resuming from checkpoint for accumulator values > 1, the starting epoch would be off by a factor equal to the argument --gradient_accumulation_steps and training, not surprisingly, would begin, load, and exit prematurely, without any corresponding error message. 😞

Hope this helps someone.

patrickvonplaten commented 1 year ago

Thanks a lot for your clear message @martin-haynes!

I think we've recently done some fixes to exactly this problem. @pcuenca do you maybe have an idea or could you investigate what is going on there? :-)

martin-haynes commented 1 year ago

It should be fixed, @patrickvonplaten, though I had already patched my local independently. I've copied the relevant commit above. The broken first_epoch calc on line 766 was fixed in the commit @pcuenca introduced on Jan 24, 2023.

<<<  first_epoch = resume_global_step // num_update_steps_per_epoch
>>>  first_epoch = global_step // num_update_steps_per_epoch

My callout was mostly to assist with folks who land on this issue thread without the context that the commit history provides. Since Pedro's PR dealt with a number of other issues relevant to resuming from checkpoint, I thought it would be valuable to add a bit of voiceover to this specific issue, especially since it resulted in silent failure, sure to leave many confused with what went wrong and how to resolve.

tl;dr -> git pull