Closed VincentLu91 closed 1 year ago
That is strange after 13 epochs.
Did you reach your max train iters? I also tend to pass --no_test
but not sure it's used on this version of TI.
I suggest training with e.g. repeats: 1
in v1-finetune.yaml
so you get to the error earlier (not after 505 * 13 steps).
In that case 1 epoch will be 5 steps I think -well, 10 with the DDIM steps- (it won't learn, but it'll be better for debugging).
inv1-finetune.yaml
I had set max_steps
to 7000 so would that reach 13 epochs (or maybe close to 13 epochs)? 505*13 would equal 6565 steps I think.
Now I set the max_steps
to 1000 and the repeats
to 1
Command is now:
main.py --base ./configs/stable-diffusion/v1-finetune.yaml -t --actual_resume ./models/ldm/stable-diffusion-v1/model.ckpt -n n_name --gpus 0, --data_root /data/path/ --no_test
After 199 epochs, the training failed with the same error.
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00, 7.93it/s]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:11<00:00, 4.18it/s]
Epoch 199: 100%|█| 10/10 [31:53<00:00, 191.38s/it, loss=0.0806, v_num=0, train/loss_simple_step=0.00786, train/loss_vlb_step=3.75e-5, trAverage Epoch time: 153.65 seconds
Average Peak memory 9198.27MiB
Epoch 199: 100%|█| 10/10 [31:53<00:00, 191.38s/it, loss=0.0806, v_num=0, train/loss_simple_step=0.00786, train/loss_vlb_step=3.75e-5, tra
Traceback (most recent call last):
File "/opt/trainml/models/main.py", line 978, in <module>
trainer.test(model, data)
File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in test
return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _test_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1161, in _run
verify_loop_configurations(self)
File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 46, in verify_loop_configurations
__verify_eval_loop_configuration(trainer, model, "test")
File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 197, in __verify_eval_loop_configuration
raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.
I get
`Trainer.fit` stopped: `max_steps=100` reached.
raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.
after reaching max_steps
.
Epoch 0: 100% 100/100 [02:53<00:00, 1.73s/it
Epoch 1: 100% 100/100 [03:03<00:00, 1.84s/it
I'm using pytorch-lighning 1.7.5
and CSVLogger (a recent PR in dev branch changed TestTubeLogger to CSVLogger for cuda too).
Edit: looking at when it stops for you, Epoch 199 is really Epoch number 200 (starts at Epoch 0), so you've done 200 x 10 = 2000 steps. But of those, half are with DDIM (each Epoch you do 5 learning steps and 5 DDIM steps), so it's really stopping at (learning) step 1000, just as you set it to.
And in the case of Epoch 13 (it's really the 14th), you're doing 14 x 505 steps = 7070, but 5 (images) * 14 = 70 are DDIM steps, so it stops at (learning) step 7000 (as you specified in max_steps
).
I set it to max_steps: 1000000000
so it does not bother me. I suggest you do something similar.
Let me know if it solves it!
besides changing max_steps
, do I keep the repeat: 1
values the same?
You could experiment, but not sure it'll learn with repeat: 1
. I have it set to repeat: 100
, e.g.
params:
size: 512
set: train
per_image_tokens: false
repeats: 100 # images in folder (e.g. 50 images) * repeats
And then it'll do an additional n steps (for n images in training folder) per Epoch. So e.g. for 50 training images and repeats: 100 -> 50 * 100 + 50 = 5050 per epoch.
Also, I have num_vectors_per_token: 6
(unrelated to this). In case it doesn't learn well with num_vectors_per_token: 1
, give that change a try.
I get
`Trainer.fit` stopped: `max_steps=100` reached. raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.") pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.
after reaching
max_steps
.Epoch 0: 100% 100/100 [02:53<00:00, 1.73s/it Epoch 1: 100% 100/100 [03:03<00:00, 1.84s/it
I'm using
pytorch-lighning 1.7.5
and CSVLogger (a recent PR in dev branch changed TestTubeLogger to CSVLogger for cuda too).Edit: looking at when it stops for you, Epoch 199 is really Epoch number 200 (starts at Epoch 0), so you've done 200 x 10 = 2000 steps. But of those, half are with DDIM (each Epoch you do 5 learning steps and 5 DDIM steps), so it's really stopping at (learning) step 1000, just as you set it to.
I guess for you, the exception was reached when you finished training after max steps? It felt misleading when I got an exception I thought it was an error or something wrong that I did.
Regarding DDIM steps and max_steps, what I'm getting is: if the steps taken after an epoch is greater than the max_steps (including the DDIM steps = n_epochs * n_images), that's when the training ends, hence the exception appears at the end of the output.
I'll set the repeat values back to the default as I think one was 100 and the other was 10.
To sum it up, is this not an error? Is the exception just a message to signal the end of training?
Yes, it is not an error (it's a weird way to end the training, I agree). Training goes on until it reaches max_steps
(or it is stopped, via Control+C
). I suggest setting a large max_steps
so it never reaches it, and manually stopping training when you are satisfied (I often look at the images in images/val
folder -if they are good for that epoch, you can stop training after the embedding for the epoch is saved).
Regarding DDIM steps and max_steps, what I'm getting is: if the steps taken after an epoch is greater than the max_steps (including the DDIM steps = n_epochs * n_images), that's when the training ends, hence the exception appears at the end of the output.
DDIM steps are a bit murky. I think there's two types (in Textual Inversion training). DDIM for training and DDIM for validation. The DDIM steps for validation (the ones that end up creating images in images/val
) are not counted towards max_steps
I think (speaking from memory -could be wrong), but DDIM steps for training (the ones that end up creating images in images/train
) may be.
-> Example: max_steps: 18
and it run 6 epochs of 6 steps before stopping:
`Trainer.fit` stopped: `max_steps=18` reached.
Epoch 5: 100%|█| 6/6 [00:32<00:00
That'd be 6 x 6 = 36 steps (double of what I set in max_steps
). But that's why I say I think only n_images
steps may be counted towards max_steps
each epoch (in this case, 6 epochs x 3 images = 18 steps).
But mainly, n_epochs * n_images
are what matter most (e.g. with 50 images and 100 repeats, that's 5000 steps, while 50 validation steps would only take it to 5050 steps).
Thanks for explaining. Yeah I'm trying to make sense of all this and I was quite alarmed for the longest time when I got an exception. Good to know it was not an error at all.
I saw some of the images in the training and validation sets and it looks like they look more or less plausible. It would be great to next figure out best practices to fine-tune training to generate objects that are as photo-realistic as possible.
Again, thanks for the help @Any-Winter-4079 !
Describe your environment
Describe the bug I was custom training my model using textual inversion on a dataset with 5 images. At epoch 13 I got: "No
test_dataloader()
method defined to runTrainer.test
". Python version is 3.9; PyTorch_lightning's version is 1.6 as TestTubeLogger was removed in 1.7 onward.To Reproduce Steps to reproduce the behavior:
python main.py --base ./configs/stable-diffusion/v1-finetune.yaml -t --actual_resume ./models/ldm/stable-diffusion-v1/model.ckpt -n name_of_run --gpus 0, --data_root /path/to/your/images
Expected behavior Training should complete after all epochs and all loaders should be present throughout training
Screenshots