No `test_dataloader()` method defined to run `Trainer.test` in epoch 13

VincentLu91 commented 1 year ago

Describe your environment

GPU: cuda
VRAM: 24 GPU RAM (I use RTX3090 GPU)
CPU arch: [x86/arm] - N/A
OS: Linux
Python: used a GPU-enabled notebook for training so no conda required, just pip install of requirements.txt
Branch: [if git status says anything other than "On branch main" paste it here] - N/A
Commit: [run git show and paste the line that starts with "Merge" here] - N/A

Describe the bug I was custom training my model using textual inversion on a dataset with 5 images. At epoch 13 I got: "No test_dataloader() method defined to run Trainer.test". Python version is 3.9; PyTorch_lightning's version is 1.6 as TestTubeLogger was removed in 1.7 onward.

To Reproduce Steps to reproduce the behavior:

Run python main.py --base ./configs/stable-diffusion/v1-finetune.yaml -t --actual_resume ./models/ldm/stable-diffusion-v1/model.ckpt -n name_of_run --gpus 0, --data_root /path/to/your/images
At epoch 13, see error

Expected behavior Training should complete after all epochs and all loaders should be present throughout training

Screenshots

Epoch 13:  99%|▉| 499/505 [45:38<00:32,  5.49s/it, loss=0.265, v_num=0, train/loss_simple_step=0.0114, train/loss_vlb_step=5.12e-5, trainpop from empty list
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 14.00it/s]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  8.13it/s]
Epoch 13:  99%|▉| 500/505 [45:51<00:27,  5.50s/it, loss=0.248, v_num=0, train/loss_simple_step=0.064, train/loss_vlb_step=0.000221, trainEpoch 13, global step 7000: 'val/loss_simple_ema' was not in top 1
                                                                                                                                        pop from empty listder 0:   0%|                                                                                     | 0/5 [00:00<?, ?it/s]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 13.87it/s]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  8.13it/s]
Epoch 13:  99%|▉| 501/505 [46:05<00:22,  5.52s/it, loss=0.248, v_num=0, train/loss_simple_step=0.064, train/loss_vlb_step=0.000221, trainpop from empty lister 0:  20%|███████████████▍                                                             | 1/5 [00:13<00:53, 13.33s/it]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 13.95it/s]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  8.13it/s]
Epoch 13:  99%|▉| 502/505 [46:17<00:16,  5.53s/it, loss=0.248, v_num=0, train/loss_simple_step=0.064, train/loss_vlb_step=0.000221, trainpop from empty lister 0:  40%|██████████████████████████████▊                                              | 2/5 [00:25<00:38, 12.93s/it]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 13.98it/s]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  8.14it/s]
Epoch 13: 100%|▉| 503/505 [46:30<00:11,  5.55s/it, loss=0.248, v_num=0, train/loss_simple_step=0.064, train/loss_vlb_step=0.000221, trainpop from empty lister 0:  60%|██████████████████████████████████████████████▏                              | 3/5 [00:38<00:25, 12.80s/it]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 14.04it/s]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  8.13it/s]
Epoch 13: 100%|▉| 504/505 [46:42<00:05,  5.56s/it, loss=0.248, v_num=0, train/loss_simple_step=0.064, train/loss_vlb_step=0.000221, trainpop from empty lister 0:  80%|█████████████████████████████████████████████████████████████▌               | 4/5 [00:50<00:12, 12.75s/it]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 13.87it/s]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  8.15it/s]
Epoch 13: 100%|█| 505/505 [46:55<00:00,  5.58s/it, loss=0.248, v_num=0, train/loss_simple_step=0.064, train/loss_vlb_step=0.000221, traiAverage Epoch time: 201.32 seconds                                                                                                        
Average Peak memory 9198.52MiB
Epoch 13: 100%|█| 505/505 [46:55<00:00,  5.58s/it, loss=0.248, v_num=0, train/loss_simple_step=0.064, train/loss_vlb_step=0.000221, train
Traceback (most recent call last):
  File "/opt/trainml/models/main.py", line 978, in <module>
    trainer.test(model, data)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _test_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1161, in _run
    verify_loop_configurations(self)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 46, in verify_loop_configurations
    __verify_eval_loop_configuration(trainer, model, "test")
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 197, in __verify_eval_loop_configuration
    raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.

Any-Winter-4079 commented 1 year ago

That is strange after 13 epochs. Did you reach your max train iters? I also tend to pass --no_test but not sure it's used on this version of TI. I suggest training with e.g. repeats: 1 in v1-finetune.yaml so you get to the error earlier (not after 505 * 13 steps). In that case 1 epoch will be 5 steps I think -well, 10 with the DDIM steps- (it won't learn, but it'll be better for debugging).

VincentLu91 commented 1 year ago

inv1-finetune.yaml I had set max_steps to 7000 so would that reach 13 epochs (or maybe close to 13 epochs)? 505*13 would equal 6565 steps I think.

Now I set the max_steps to 1000 and the repeats to 1

Command is now:

main.py --base ./configs/stable-diffusion/v1-finetune.yaml -t --actual_resume ./models/ldm/stable-diffusion-v1/model.ckpt -n n_name --gpus 0, --data_root /data/path/ --no_test

After 199 epochs, the training failed with the same error.

DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:06<00:00,  7.93it/s]
DDIMSampler: 100%|███████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:11<00:00,  4.18it/s]
Epoch 199: 100%|█| 10/10 [31:53<00:00, 191.38s/it, loss=0.0806, v_num=0, train/loss_simple_step=0.00786, train/loss_vlb_step=3.75e-5, trAverage Epoch time: 153.65 seconds                                                                                                        
Average Peak memory 9198.27MiB
Epoch 199: 100%|█| 10/10 [31:53<00:00, 191.38s/it, loss=0.0806, v_num=0, train/loss_simple_step=0.00786, train/loss_vlb_step=3.75e-5, tra
Traceback (most recent call last):
  File "/opt/trainml/models/main.py", line 978, in <module>
    trainer.test(model, data)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 939, in test
    return self._call_and_handle_interrupt(self._test_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _test_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1161, in _run
    verify_loop_configurations(self)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 46, in verify_loop_configurations
    __verify_eval_loop_configuration(trainer, model, "test")
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/configuration_validator.py", line 197, in __verify_eval_loop_configuration
    raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.

Any-Winter-4079 commented 1 year ago

I get

`Trainer.fit` stopped: `max_steps=100` reached.
raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.

after reaching max_steps.

Epoch 0: 100% 100/100 [02:53<00:00,  1.73s/it
Epoch 1: 100% 100/100 [03:03<00:00,  1.84s/it

I'm using pytorch-lighning 1.7.5 and CSVLogger (a recent PR in dev branch changed TestTubeLogger to CSVLogger for cuda too).

Edit: looking at when it stops for you, Epoch 199 is really Epoch number 200 (starts at Epoch 0), so you've done 200 x 10 = 2000 steps. But of those, half are with DDIM (each Epoch you do 5 learning steps and 5 DDIM steps), so it's really stopping at (learning) step 1000, just as you set it to.

Any-Winter-4079 commented 1 year ago

And in the case of Epoch 13 (it's really the 14th), you're doing 14 x 505 steps = 7070, but 5 (images) * 14 = 70 are DDIM steps, so it stops at (learning) step 7000 (as you specified in max_steps).

I set it to max_steps: 1000000000 so it does not bother me. I suggest you do something similar. Let me know if it solves it!

VincentLu91 commented 1 year ago

besides changing max_steps, do I keep the repeat: 1 values the same?

Any-Winter-4079 commented 1 year ago

You could experiment, but not sure it'll learn with repeat: 1. I have it set to repeat: 100, e.g.

params:
        size: 512
        set: train
        per_image_tokens: false
        repeats: 100 # images in folder (e.g. 50 images) * repeats

And then it'll do an additional n steps (for n images in training folder) per Epoch. So e.g. for 50 training images and repeats: 100 -> 50 * 100 + 50 = 5050 per epoch.

Any-Winter-4079 commented 1 year ago

Also, I have num_vectors_per_token: 6 (unrelated to this). In case it doesn't learn well with num_vectors_per_token: 1, give that change a try.

VincentLu91 commented 1 year ago

I get
`Trainer.fit` stopped: `max_steps=100` reached.
raise MisconfigurationException(f"No `{loader_name}()` method defined to run `Trainer.{trainer_method}`.")
pytorch_lightning.utilities.exceptions.MisconfigurationException: No `test_dataloader()` method defined to run `Trainer.test`.
after reaching max_steps.
Epoch 0: 100% 100/100 [02:53<00:00,  1.73s/it
Epoch 1: 100% 100/100 [03:03<00:00,  1.84s/it
I'm using pytorch-lighning 1.7.5 and CSVLogger (a recent PR in dev branch changed TestTubeLogger to CSVLogger for cuda too).

Edit: looking at when it stops for you, Epoch 199 is really Epoch number 200 (starts at Epoch 0), so you've done 200 x 10 = 2000 steps. But of those, half are with DDIM (each Epoch you do 5 learning steps and 5 DDIM steps), so it's really stopping at (learning) step 1000, just as you set it to.

I guess for you, the exception was reached when you finished training after max steps? It felt misleading when I got an exception I thought it was an error or something wrong that I did.

Regarding DDIM steps and max_steps, what I'm getting is: if the steps taken after an epoch is greater than the max_steps (including the DDIM steps = n_epochs * n_images), that's when the training ends, hence the exception appears at the end of the output.

I'll set the repeat values back to the default as I think one was 100 and the other was 10.

To sum it up, is this not an error? Is the exception just a message to signal the end of training?

Any-Winter-4079 commented 1 year ago

Yes, it is not an error (it's a weird way to end the training, I agree). Training goes on until it reaches max_steps (or it is stopped, via Control+C). I suggest setting a large max_steps so it never reaches it, and manually stopping training when you are satisfied (I often look at the images in images/val folder -if they are good for that epoch, you can stop training after the embedding for the epoch is saved).

Regarding DDIM steps and max_steps, what I'm getting is: if the steps taken after an epoch is greater than the max_steps (including the DDIM steps = n_epochs * n_images), that's when the training ends, hence the exception appears at the end of the output.

DDIM steps are a bit murky. I think there's two types (in Textual Inversion training). DDIM for training and DDIM for validation. The DDIM steps for validation (the ones that end up creating images in images/val) are not counted towards max_steps I think (speaking from memory -could be wrong), but DDIM steps for training (the ones that end up creating images in images/train) may be.

-> Example: max_steps: 18 and it run 6 epochs of 6 steps before stopping:

`Trainer.fit` stopped: `max_steps=18` reached.
Epoch 5: 100%|█| 6/6 [00:32<00:00

That'd be 6 x 6 = 36 steps (double of what I set in max_steps). But that's why I say I think only n_images steps may be counted towards max_steps each epoch (in this case, 6 epochs x 3 images = 18 steps).

But mainly, n_epochs * n_images are what matter most (e.g. with 50 images and 100 repeats, that's 5000 steps, while 50 validation steps would only take it to 5050 steps).

VincentLu91 commented 1 year ago

Thanks for explaining. Yeah I'm trying to make sense of all this and I was quite alarmed for the longest time when I got an exception. Good to know it was not an error at all.

I saw some of the images in the training and validation sets and it looks like they look more or less plausible. It would be great to next figure out best practices to fine-tune training to generate objects that are as photo-realistic as possible.

Again, thanks for the help @Any-Winter-4079 !

invoke-ai / InvokeAI

No `test_dataloader()` method defined to run `Trainer.test` in epoch 13 #1162