Closed erliding closed 7 months ago
Thanks for reporting.
While training instrcutp2p model with rescale_betas_zero_snr enabled is important to improve model preserve properties of input image
Can you elaborate this a bit more?
In any case, I will let @patil-suraj comment further here.
I have similar problem when I train instruct pix2pix SDXL with fp16, inference of validation images are black during the training. If I change the fp16 to fp32, problem is solved.
Thanks for reporting.
While training instrcutp2p model with rescale_betas_zero_snr enabled is important to improve model preserve properties of input image
Can you elaborate this a bit more?
In any case, I will let @patil-suraj comment further here.
so basically it seems the gap between training with non-zero snr at last timestep and inference starting with gaussian matters more for instruct p2p, as during training model learns to edit image based on both the input image as condition and its diffused latent which always contains info about original input, but at inference time this piece of noisy info is not available, i find passing diffused input latent to model even with an alpha smaller than 0.0682 can make model perform better in terms of property preservation than starting with gaussian, similar effect can also be achieved by training model with zero terminal snr, this way it doesn’t matter if inference start with gaussian
Did you try using a better VAE when doing the SDXL training?
No, I will try this. Thank you!
By the way, I have relevat question about instruct pix2pix SDXL training. When I using one gpu to train instruct pix2pix SDXL, it works, but training with multigpu, it is out of GPU memory. Following is my accelarate env and training config:
- `Accelerate` version: 0.27.0
- Platform: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.10
- Python version: 3.8.12
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 1007.35 GB
- GPU type: NVIDIA A100-PCIE-40GB
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 3
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1,2
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix_sdxl.py \
--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \
--dataset_name=$DATASET_ID \
--use_ema \
--enable_xformers_memory_efficient_attention \
--resolution=512 --random_flip \
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
--max_train_steps=15000 \
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
--learning_rate=5e-05 --lr_warmup_steps=0 \
--conditioning_dropout_prob=0.05 \
--seed=42 \
--val_image_url_or_path="https://datasets-server.huggingface.co/assets/fusing/instructpix2pix-1000-samples/--/fusing--instructpix2pix-1000-samples/train/23/input_image/image.jpg" \
--validation_prompt="make it in japan" \
--use_8bit_adam \
--report_to=wandb \
You can refer to https://github.com/sayakpaul/instructpix2pix-sdxl for better utilization of GPUs.
so basically it seems the gap between training with non-zero snr at last timestep and inference starting with gaussian matters more for instruct p2p, as during training model learns to edit image based on both the input image as condition and its diffused latent which always contains info about original input, but at inference time this piece of noisy info is not available, i find passing diffused input latent to model even with an alpha smaller than 0.0682 can make model perform better in terms of property preservation than starting with gaussian, similar effect can also be achieved by training model with zero terminal snr, this way it doesn’t matter if inference start with gaussian
Oh nice. Do you have some results for us to see? Just curious.
if up_cast for sdxl vae is enabled, which default to be true, and still have black images, probably due to the same reason i mentioned
so basically it seems the gap between training with non-zero snr at last timestep and inference starting with gaussian matters more for instruct p2p, as during training model learns to edit image based on both the input image as condition and its diffused latent which always contains info about original input, but at inference time this piece of noisy info is not available, i find passing diffused input latent to model even with an alpha smaller than 0.0682 can make model perform better in terms of property preservation than starting with gaussian, similar effect can also be achieved by training model with zero terminal snr, this way it doesn’t matter if inference start with gaussian
Oh nice. Do you have some results for us to see? Just curious.
sorry, not for my personal development, guess i am not allowed to share
You can refer to https://github.com/sayakpaul/instructpix2pix-sdxl for better utilization of GPUs.
Thank you!
hi @erliding:
I agree these codes are unnecessary - you're welcome to open a PR to remove them :)
However, I don't think this is the reason that caused the fp16 overflow - note that we do this step inside the scheduler.step anyway https://github.com/huggingface/diffusers/blob/3a7e481611bc299416aaeed4207086d9ddca5852/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py#L411
You can refer to https://github.com/sayakpaul/instructpix2pix-sdxl for better utilization of GPUs.
Hi!, I try to train completely according to the guidelines which is recommend to use webdataset, but still reported out of memory error for multiple GPUs, but it works with single GPU. It is very confusing !
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export MODEL_ID="/path/to/stable-diffusion-xl-base-1.0"
export DATASET_ID="/path/to/instructpix2pix-sdxl/upscale_swin2sr/instructpix2pix-clip-filtered_wds/{00000..00001}.tar"
export OUTPUT_DIR="sdxl-instructpix2pix"
export VAE_PATH="/path/to/sdxl-vae-fp16-fix"
accelerate launch --multi_gpu train_instruct_pix2pix_sdxl.py \
--pretrained_model_name_or_path=$MODEL_ID \
--pretrained_vae_model_name_or_path=$VAE_PATH \
--dataset_path="$DATASET_ID" \
--use_ema \
--enable_xformers_memory_efficient_attention \
--resolution=256 --random_flip \
--per_gpu_batch_size=1 --gradient_accumulation_steps=4 \
--num_workers=1 \
--max_train_steps=10000 \
--checkpointing_steps=2500 \
--learning_rate=1e-5 --lr_warmup_steps=0 \
--mixed_precision=fp16 \
--seed=42 \
--output_dir=$OUTPUT_DIR \
- `Accelerate` version: 0.27.2
- Platform: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.10
- Python version: 3.8.12
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 1007.35 GB
- GPU type: NVIDIA A100-PCIE-40GB
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 3
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 4, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
We are digressing from the original thread so, let’s not do that.
But what I can tell you is that the repository works perfectly fine on 8 80GB A100s. Other than that, I cannot help you much.
We are digressing from the original thread so, let’s not do that.
But what I can tell you is that the repository works perfectly fine on 8 80GB A100s. Other than that, I cannot help you much.
I understand, but it's really weird that even when I use a very small dataset (only 500 image pairs), it doesn't work with multiple GPUs due to running out of memory. However, if I use a large dataset with a single GPU, it works fine.
Anyway, that's just how it is. Another thing I wanted to mention is that I'm eagerly awaiting the release of instructpix2pix with Stable Cascade. I hope it comes out soon.
Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I think @erliding fixed this? So, I am closing the issue.
Describe the bug
InstructPix2Pix pipeline has a flag
scheduler_is_in_sigma_space = hasattr(self.scheduler, "sigmas")
which result latent value scale back and forth from sigma space for calculation of classifer free guidance, which seems not necessary, and especially when settingrescale_betas_zero_snr=True
for e.g.EulerDiscreteScheduler
,EulerAncestralDiscreteScheduler
can generate black images with high rate due to fp16 overflow. While training instrcutp2p model withrescale_betas_zero_snr
enabled is important to improve model preserve properties of input imageReproduction
overflow can happen quite often from https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py#L439, as
rescale_betas_zero_snr
scale max sigma from 14.61 to 4096Logs
No response
System Info
main branch
Who can help?
@patil-suraj @sayakpaul