batch_size changes messed with training

david-ford commented 2 years ago

Epoch 0: 3%| | 20/606 [00:14<07:00, 1.39it/s, loss=0.142, v_num=0, train/loss_simple_step=0.0314, train/loss_vlb_steC:\Users\Tyrant\anaconda3\envs\ldm\lib\site-packages\pytorch_lightning\utilities\data.py:56: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 20. To avoid any miscalculations, use self.log(..., batch_size=batch_size). warning_cache.warn(

Just FYI, this is what happens now when trying to train models. There's no longer a way to set batch size, so it just figures it out on its own. My batch size was set to 1. There were many levels of change to batch_size from multiple people so I have no idea when/where it broke. :)

lstein commented 2 years ago

That's most unfortunate. We removed batch size from the higher level modules used for inference, and didn't intend to impact training. Are you using main.py for training, or some other entry point? I'm sure this can be fixed.

bakkot commented 2 years ago

There were many levels of change to batch_size from multiple people so I have no idea when/where it broke.

@david-ford git bisect is a good way to track down this sort of thing. First find an old commit which works for you, then bisect to find where it broke. If you'd like this fixed, that's a good place to start.

david-ford commented 2 years ago

I have a little more time to dig in today and see what I can find. My knowledge of python is fairly limited though. I've got about 6~ hours of actually working with the language under my belt 😂

I think all it needs is to have the arg passed in on main.py and then fed into the proper method. I don't get the impression it's a complicated fix. That said I also need to double check my version of pytorch because passing in empty output on the epoch_end method causes it to crash. I think in 1.11 it was required but in 1.12 it's not.

lstein commented 2 years ago

I haven't experimented with the training functionality at all at this point. You're using "main.py" to train, is that right? As far as I can see, we have not changed batch_size in this, or in fact changed the script at all. The only dependency on changed code is ldm/util.py, which is all cosmetic changes and doesn't effect batch_size.

I suspect that you're seeing effects from a change in an external library such as pytorch. This fork may be bringing in a different version of the library that exhibits the bug.

david-ford commented 2 years ago

Here's some more information to help with narrowing down where the issue might be. You are probably right, which is why I decided to wipe out my env and start over from step 1. That's a whole process on its own, lemme tell ya. One of these days I'll actually note all the extra steps that are missing for the windows side >.>

Here's a list of -all- the args that Main.py thinks it supports, though many of them appear to not be actually wired up to anything.

[-h] [-n [NAME]] [-r [RESUME]] [-b [base_config.yaml [base_config.yaml ...]]] [-t [TRAIN]] [-p PROJECT] [-d [DEBUG]] [-s SEED] [-f POSTFIX] [-l LOGDIR] [--no-test [NO_TEST]] [--scale_lr [SCALE_LR]] [--datadir_in_name [DATADIR_IN_NAME]] [--actual_resume ACTUAL_RESUME] [--data_root DATA_ROOT] [--embedding_manager_ckpt EMBEDDING_MANAGER_CKPT] [--placeholder_tokens PLACEHOLDER_TOKENS [PLACEHOLDER_TOKENS ...]] [--init_word INIT_WORD] [--logger [LOGGER]] [--checkpoint_callback [CHECKPOINT_CALLBACK]] [--enable_checkpointing [ENABLE_CHECKPOINTING]] [--default_root_dir DEFAULT_ROOT_DIR] [--gradient_clip_val GRADIENT_CLIP_VAL] [--gradient_clip_algorithm GRADIENT_CLIP_ALGORITHM] [--process_position PROCESS_POSITION] [--num_nodes NUM_NODES] [--num_processes NUM_PROCESSES] [--devices DEVICES] [--gpus GPUS] [--auto_select_gpus [AUTO_SELECT_GPUS]] [--tpu_cores TPU_CORES] [--ipus IPUS] [--log_gpu_memory LOG_GPU_MEMORY] [--progress_bar_refresh_rate PROGRESS_BAR_REFRESH_RATE] [--enable_progress_bar [ENABLE_PROGRESS_BAR]] [--overfit_batches OVERFIT_BATCHES] [--track_grad_norm TRACK_GRAD_NORM] [--check_val_every_n_epoch CHECK_VAL_EVERY_N_EPOCH] [--fast_dev_run [FAST_DEV_RUN]] [--accumulate_grad_batches ACCUMULATE_GRAD_BATCHES] [--max_epochs MAX_EPOCHS] [--min_epochs MIN_EPOCHS] [--max_steps MAX_STEPS] [--min_steps MIN_STEPS] [--max_time MAX_TIME] [--limit_train_batches LIMIT_TRAIN_BATCHES] [--limit_val_batches LIMIT_VAL_BATCHES] [--limit_test_batches LIMIT_TEST_BATCHES] [--limit_predict_batches LIMIT_PREDICT_BATCHES] [--val_check_interval VAL_CHECK_INTERVAL] [--flush_logs_every_n_steps FLUSH_LOGS_EVERY_N_STEPS] [--log_every_n_steps LOG_EVERY_N_STEPS] [--accelerator ACCELERATOR] [--strategy STRATEGY] [--sync_batchnorm [SYNC_BATCHNORM]] [--precision PRECISION] [--enable_model_summary [ENABLE_MODEL_SUMMARY]] [--weights_summary WEIGHTS_SUMMARY] [--weights_save_path WEIGHTS_SAVE_PATH] [--num_sanity_val_steps NUM_SANITY_VAL_STEPS] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT] [--profiler PROFILER] [--benchmark [BENCHMARK]] [--deterministic [DETERMINISTIC]] [--reload_dataloaders_every_n_epochs RELOAD_DATALOADERS_EVERY_N_EPOCHS] [--reload_dataloaders_every_epoch [RELOAD_DATALOADERS_EVERY_EPOCH]] [--auto_lr_find [AUTO_LR_FIND]] [--replace_sampler_ddp [REPLACE_SAMPLER_DDP]] [--detect_anomaly [DETECT_ANOMALY]] [--auto_scale_batch_size [AUTO_SCALE_BATCH_SIZE]] [--prepare_data_per_node [PREPARE_DATA_PER_NODE]] [--plugins PLUGINS] [--amp_backend AMP_BACKEND] [--amp_level AMP_LEVEL] [--move_metrics_to_cpu [MOVE_METRICS_TO_CPU]] [--multiple_trainloader_mode MULTIPLE_TRAINLOADER_MODE] [--stochastic_weight_avg [STOCHASTIC_WEIGHT_AVG]] [--terminate_on_nan [TERMINATE_ON_NAN]]

I dug through Main.py for a few hours trying to figure out how to fix that with no luck. I went as far as hard coding batch_size to 1 at every point it was used in the Main.py and still no luck. That means it's getting overridden somewhere else along the chain.

The command line that I was attempting to run is structured as follows: python main.py -b configs\stable-diffusion\v1-finetune.yaml -t --actual_resume models\ldm\stable-diffusion-v1\model.ckpt -n scythe --gpus 0, --data_root ..\resources\images\processed\01 init_word photograph --placeholder_tokens tool

Following along to this point you will probably start down the same rabbit hole I did. You'll see a massive wall of warning logs about deprecations and such, and then you get to an error about on_train_epoch_end() not receiving required param "outputs". So after a quick search around the web, turns out, just delete that.

Next run you will get a new error. Trainer.test failed because there's no method defined to test. <- This is where I'm stuck because this happens right before output and crashes the whole process.

lstein commented 2 years ago

I may be able to help. Give me a quick synopsis for how to reproduce the bug. eg what do i need to do to kick off a training?

david-ford commented 2 years ago

I may be able to help. Give me a quick synopsis for how to reproduce the bug. eg what do i need to do to kick off a training?

First, create a folder that you can easily reference from the project dir. I use a sibling folder at ..\resources\images\{img-topic}

Since I'm on windows, it makes me use "\" but if you're running a unix based OS it will probably want "/". You likely know this, but I'm adding it here for any lurkers that might be following along. :)

Add a few images to this folder that you are going to use for training. For the purpose of testing this, they can be anything you want, just make sure they're the right size. 256x256 or 512x512 depending on your settings in configs\stable-diffusion\v1-finetune.yaml

Once you have done this part, all that remains is running this line from the console:

python main.py -t -b configs\stable-diffusion\v1-finetune.yaml --actual_resume models\ldm\stable-diffusion-v1\model.ckpt -n scythe --gpus 0, --data_root ..\resources\images\processed\01 init_word photograph --placeholder_tokens tool

To break this down: python main.py The script we need to run. -t tells it that we're running this as a training session and flips the "train" flag to true so it will create a Trainer -b is the path to the model config for all the training args --actual_resume is the path to the existing model you're training off of -n is the name for this training batch. It's used by the system for lookup later if you do a -r to resume training --gpus 0, I believe this is required to make it use cuda and to tell it which GPUs in a multi-gpu system to use --data_root is the path to the folder where your images are located that you will be using to train --init_word is a word that is used by the trainer to initialize the training model. This appears to be ignored if you have it set in your config, so I'm not sure why it's here. --placeholder_tokens can be multiple words, it's used to help the trainer relate the images to existing tokens. Each word gets used in training as a substitute in the strings provided in ldm\data\personalized.py

The first crash you will likely encounter will be related to on_train_epoch_end() located in main.py. Remove outputs from the list of args, and then run again. It should get past this part, but the next one will crash when attempting to run Trainer.test because no method was defined for self.test_dataloader, even though in main.py it gets defined on line 255 and 308. I tried modifying this section to directly override these methods with defined ones on the class, but I had the same result. It appears that --no-test also gets ignored.

For the batch issue, you will see it reported when it starts epoch 0, multiple lines will report that it couldn't find the batch_size and had to infer it later.

lstein commented 2 years ago

Hi David, just to let you know that I haven't forgotten about this and will get to work on it as soon as I complete the work needed to bring in memory optimizations and inpainting.

david-ford commented 2 years ago

No worries. I was able to download and setup the diffusers repo and get it working there in the meantime

invoke-ai / InvokeAI

batch_size changes messed with training #386