Open clement-swk opened 8 months ago
Does it happen without DeepSpeed? I am sadly not well-versed in DeepSpeed, so cannot help much.
https://github.com/huggingface/diffusers/pull/6628/files should fix the problem I think.
@sayakpaul I need deepspeed, otherwise training won't start (nvidia out of memory error)
I have edited my comment. See if that helps.
I have added the configuration in the command as
accelerate launch --config_file $ACCELERATE_CONFIG_FILE train_text_to_image_sdxl.py --pretrained_model_name_or_
path=$MODEL_NAME --pretrained_vae_model_name_or_path=$VAE_NAME --dataset_name=$DATASET_NAME --enable_xformers_memory_efficient_attention --resolution=512 --center_crop --random_flip --proportion_empty_prompts=0.2 --train_batch_size=1 --gradient_accumulation_steps=4 --gradient_checkpointing --max_train_steps=10000 --use_8bit_adam --learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0 --mixed_precision="fp16" --validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5 --checkpointing_steps=5 --output_dir="sdxl-pokemon-model"
but the same problem happens.
This was the config.yaml file I had
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 4
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
How about applying changes from https://github.com/huggingface/diffusers/pull/6628/? More specifically, the changes introduced in examples/text_to_image/train_text_to_image_lora_sdxl.py
?
@sayakpaul Trying
if isinstance(unwrap_model(model), type(unwrap_model(unet))):
model.save_pretrained(os.path.join(output_dir, "unet"))
in the code didn't change the error
How about:
if isinstance(unwrap_model(model), type(unwrap_model(unet))):
+ unwrap_model(model).save_pretrained(os.path.join(output_dir, "unet"))
?
@sayakpaul The same error happens, even with
if isinstance(unwrap_model(model), type(unwrap_model(unet))):
+ unwrap_model(model).save_pretrained(os.path.join(output_dir, "unet"))
I have also tried running the train_text_to_image_lora_sdxl.py to see if worked and got the same error as in train_text_to_image_sdxl.py.
Deactivating deepspeed makes train_text_to_image_lora_sdxl.py work fine.
Cc: @HelloWorldBeginner. Could you help here if you any pointers?
I haven't used cpu offload in deepseed, but it's fine to use zero2 on 8xA100s.
@clement-swk When using deepspeed, you can install apex, which will be automatically used in deepspeed. That works for me.
git clone https://github.com/NVIDIA/apex.git
cd apex git checkout 741bdf50825a97664db08574981962d66436d16a
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"
@AoqunJin Thanks for your reply! I tried installing apex but the problem remains.
@clement-swk
You can also try removing accelerator.is_main_process
. This will avoid having to call save_model
only in the main process without being able to get the states of other devices.
In
def save_model_hook(models, weights, output_dir):
# if accelerator.is_main_process:
And
train_loss = 0.0
# if accelerator.is_main_process:
if global_step % args.checkpointing_steps == 0:
At train_text_to_image_lora_sdxl.py
@AoqunJin I tried and it the same error appeared.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Will give this a look.
I proposed a couple of fixes here: https://github.com/huggingface/accelerate/issues/2787. Does this help?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@AoqunJin I tried and it the same error appeared. Hello, have you solved this problem? I also encountered the same problem when using deepspeed.
@jyy-1998 In my case, it's caused by the fact that deepspeed saving the ZERO model requires parameters that are maintained by other processes in allgather.
You can try different diffusers versions, such as the earlier diffusers==0.11.1
Describe the bug
I am trying to finetune SDXL but the training script crashes when saving the model at a checkpoint. Training runs fine.
Reproduction
Here is my accelerate config choices:
Then I run this taken from the example in examples/text_to_image/README_sdxl.md
Here I only modified the checkpointing_steps to cause the error to happen faster
Logs
System Info
diffusers
version: 0.27.0.dev0I have an RTX4090.
Who can help?
@sayakpaul