v_prediction in text_to_image weird images

BurgerAndreas commented 9 months ago

Describe the bug

Using --prediction_type="v_prediction" with the example text_to_image_lora.py script leads to very weird images:

With --prediction_type="epsilon" (default) the images turn out great:

Reproduction

git clone https://github.com/huggingface/diffusers.git
cd diffusers/examples/text_to_image

Same command as in provided text_to_image_lora example, but removed --mixed_precision="fp16" and added --prediction_type="v_prediction":

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
accelerate launch train_text_to_image_lora.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --dataset_name=$DATASET_NAME --caption_column="text" \
    --resolution=512 --random_flip \
    --train_batch_size=1 \
    --num_train_epochs=100 --checkpointing_steps=5000 \
    --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
    --seed=42 \
    --output_dir="sd-pokemon-model-lora" \
    --validation_prompt="dragon" --report_to="wandb" \
    --prediction_type="v_prediction"

Other settings that do not work with --prediction_type="v_prediction"

rank=4,6,8
learning_rate=1e-4 to 1e-6
num_training_epochs= up to 200
DDPMScheduler and DDIMScheduler

Logs

No response

System Info

diffusers version: 0.27.0.dev0 (also tested on 0.26)
Platform: Linux-4.15.0-213-generic-x86_64-with-glibc2.27
Python version: 3.10.12
PyTorch version (GPU?): 2.2.0+cu118 (False)
Huggingface_hub version: 0.20.3
Transformers version: 4.37.2
Accelerate version: 0.26.1
xFormers version: 0.0.24+cu118
Using GPU in script?: Yes. Tested on: RTX6000 via sbatch (slurm cluster), RTX3060 (local workstation)
Using distributed or parallel set-up in script?: No

Who can help?

@sayakpaul @patrickvonplaten

sayakpaul commented 9 months ago

I don't think it's an issue candidate. It's better off in the "discussions".

You're trying to fine-tune a model that wasn't trained v-prediction so, adapting it with that prediction objective would require a bit of experimentation, I imagine.

Ccing @patil-suraj for further advice.

BurgerAndreas commented 8 months ago

I did not know - is changing the prediction objective known to create these issues?

Would love to hear your thoughts @patil-suraj and @sayakpaul

sayakpaul commented 8 months ago

If your model was trained using say, the Karras-style objective, then if you try to fine-tune it using "epsilon prediction" it might have repercussions because for the former the timesteps are continuous while for the latter, the timesteps are usually discrete.

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul commented 7 months ago

Closing due to inactivity.

huggingface / diffusers