Closed elricwan closed 1 year ago
cc @sayakpaul here
Hi!
Here's what I did:
diffusers
with git clone https://github.com/huggingface/diffusers
. accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 \
--dataset_name=sayakpaul/instructpix2pix-1000-samples \
--use_ema \
--enable_xformers_memory_efficient_attention \
--resolution=512 --random_flip \
--train_batch_size=2 --gradient_accumulation_steps=4 --gradient_checkpointing \
--max_train_steps=20 \
--checkpointing_steps=10 --checkpoints_total_limit=1 \
--learning_rate=5e-05 --lr_warmup_steps=0 \
--conditioning_dropout_prob=0.05 \
--mixed_precision=fp16 \
--val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" \
--validation_prompt="make the mountains snowy" \
--seed=42 \
--report_to=wandb
It worked on both single-GPU and multi-GPU machines. Notice that I didn't specify the --multi-gpu
flag while launching training.
I installed diffusers
from source by running: pip install git+https://github.com/huggingface/diffusers
.
With that, the training actually went fine and I didn't face any issues.
Could you help me reproduce the error?
Hi!
Thank you for the response. I find out that the problem is with the accelerate config, if you set the accelerate default_config.yaml as, you can reproduce the error. May caused by fsdp.
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
dynamo_config:
dynamo_backend: INDUCTOR
dynamo_mode: default
dynamo_use_dynamic: true
dynamo_use_fullgraph: true
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_offload_params: false
fsdp_sharding_strategy: 2
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: ''
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
When I use the default config, which is:
{
"compute_environment": "LOCAL_MACHINE",
"distributed_type": "MULTI_GPU",
"downcast_bf16": false,
"machine_rank": 0,
"main_training_function": "main",
"mixed_precision": "no",
"num_machines": 1,
"num_processes": 4,
"rdzv_backend": "static",
"same_network": false,
"tpu_use_cluster": false,
"tpu_use_sudo": false,
"use_cpu": false
}
I got error:
RuntimeError: [3]: params[0] in this process with sizes [320, 4, 3, 3] appears not to match sizes of the same param in process 0.
Then I change the "num_processes" to 1, it runs to the optimization step, but raise the CUDA out of memory error. Even if I change the resolution to 16. Is there anything I can change? I am using a 3090 GPU with 24G memory. BTW, this the the scripts I used:
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export DATASET_ID="fusing/instructpix2pix-1000-samples"
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_ID \
--use_ema \
--enable_xformers_memory_efficient_attention \
--resolution=16 --random_flip \
--train_batch_size=1 --gradient_accumulation_steps=4 --gradient_checkpointing \
--max_train_steps=20 \
--checkpointing_steps=10 --checkpoints_total_limit=1 \
--learning_rate=5e-05 --lr_warmup_steps=0 \
--conditioning_dropout_prob=0.05 \
--mixed_precision=fp16 \
--val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" \
--validation_prompt="make the mountains snowy" \
--seed=42 \
--report_to=wandb
Maybe clear all the CUDA cache and restart the process?
You could also disable validation for dealing with lower GPU memory.
Also, maybe try with Torch 1.31.1 if possible?
I clear all the cache, what accelerate config and gpu do you use?
I am using default accelerate config with a single A100.
I see, thanks
Describe the bug
When I try to use accelerate launch train_instruct_pix2pix.py with one gpus, it report the error as below:
File "/home/xiangpeng.wan/miniconda3/envs/transformers/lib/python3.8/site-packages/accelerate/utils/dataclasses.py", line 836, in set_auto_wrap_policy raise Exception("Could not find the transformer layer class to wrap in the model.")
File "train_instruct_pix2pix.py", line 706, in main unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare Exception: Could not find the transformer layer class to wrap in the model.
I did the accelerate config default
Reproduction
export MODEL_NAME="runwayml/stable-diffusion-v1-5" export DATASET_ID="fusing/instructpix2pix-1000-samples"
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py --pretrained_model_name_or_path=$MODEL_NAME --dataset_name=$DATASET_ID --enable_xformers_memory_efficient_attention --resolution=256 --random_flip --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing --max_train_steps=15000 --checkpointing_steps=5000 --checkpoints_total_limit=1 --learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 --conditioning_dropout_prob=0.05 --mixed_precision=fp16 --seed=42
Logs
No response
System Info
diffusers-0.15.0.dev0 python=3.8 torch=2.0.0 accelerate=0.18.0 ubuntu 20.04