Textual Inversion not working with Prodigy

Poiuytrezay1 commented 9 months ago

The LR is stuck below 1e-6 when training textual inversion using Prodigy as the optimizer. Tested with both caption & object template. Changing the optimizer to AdamW and 1e-3 of LR seemed to produce correct results. That was the command I ran:

accelerate launch --num_cpu_threads_per_process=8 "./train_textual_inversion.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=1024 --pretrained_model_name_or_path="I:\A1111\checkpoints\animefull-final-pruned-fp16.safetensors" --train_data_dir="I:\A1111\dataset\test_pivotal\img" --resolution="512,512" --output_dir="I:\A1111\dataset\test_pivotal\model" --logging_dir="I:\A1111\dataset\test_pivotal\logs" --save_model_as=safetensors --output_name="test_pivotal" --lr_scheduler_num_cycles="30" --max_token_length=225 --max_train_epochs="30" --max_data_loader_n_workers="0" --no_half_vae --learning_rate="1.0" --lr_scheduler="cosine" --lr_warmup_steps="14" --train_batch_size="4" --max_train_steps="465" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="fp16" --caption_extension=".txt" --cache_latents --optimizer_type="Prodigy" --optimizer_args "decouple=True" "weight_decay=0.01" "d_coef=0.8" "use_bias_correction=True" "safeguard_warmup=True" "betas=0.9,0.99" --max_train_epochs=30 --max_data_loader_n_workers="0" --max_token_length=225 --clip_skip=2 --keep_tokens="1" --bucket_reso_steps=64 --min_snr_gamma=5 --shuffle_caption --gradient_checkpointing --xformers --bucket_no_upscale --multires_noise_iterations="8" --multires_noise_discount="0.45" --token_string="jinxcat" --init_word="girl" --num_vectors_per_token=1 --use_object_template --sample_sampler=k_dpm_2_a --sample_prompts="I:\A1111\dataset\test_pivotal\model\sample\prompt.txt" --sample_every_n_epochs="1"

EDIT: I managed to get past 1e-6 of LR by setting no warmup and batch size of 1 (though it should work with warmup)

feffy380 commented 7 months ago

Maybe prodigy doesn't react well to how most of the embeddings are overwritten with the original copies after each optimizer step?

feffy380 commented 7 months ago

I can confirm that zeroing out the gradients of the other tokens before stepping the optimizer helps prevent DAdaptAdam from exploding. Maybe it'll help with Prodigy too. However, it chose a way too high learning rate that quickly overfitted. I'll try with your embedding norm PR next.

The current code forcibly resets the values after stepping the optimizer, which seems wrong to me because we should be preventing unwanted gradient updates, not reversing them after the fact.

# zero out gradients for all tokens we aren't training
for text_encoder, index_no_updates in zip(text_encoders, index_no_updates_list):
  input_embeddings_weight = accelerator.unwrap_model(text_encoder).get_input_embeddings().weight
  input_embeddings_weight.grad[index_no_updates] = 0

optimizer.step()
lr_scheduler.step()
optimizer.zero_grad(set_to_none=True)

# with torch.no_grad():
    # Let's make sure we don't update any embedding weights besides the newly added token
    # for text_encoder, orig_embeds_params, index_no_updates in zip(
    #     text_encoders, orig_embeds_params_list, index_no_updates_list
    # ):
    #     # if full_fp16/bf16, input_embeddings_weight is fp16/bf16, orig_embeds_params is fp32
    #     input_embeddings_weight = accelerator.unwrap_model(text_encoder).get_input_embeddings().weight
    #     input_embeddings_weight[index_no_updates] = orig_embeds_params.to(input_embeddings_weight.dtype)[
    #         index_no_updates
    #     ]

kohya-ss / sd-scripts

Textual Inversion not working with Prodigy #980