Multi-gpu training - Githubissues

pixeli99 / SVD_Xtend

Stable Video Diffusion Training Code and Extensions.

481 stars 45 forks source link

Multi-gpu training #24

Closed KKN18 closed 6 months ago

KKN18 commented 6 months ago

First of, thank you for providing SVD training code.

I'm trying to train using multi-GPU, are there any changes I need to make to the code and prompt? In the code, should I uncomment this part and what else do I need to do?

    # ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
    accelerator = Accelerator(
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        project_config=accelerator_project_config,
        # kwargs_handlers=[ddp_kwargs]
    )

pixeli99 commented 6 months ago

Oh there's no need, you can just use the examples in the readme, similar to:

accelerate launch train_svd.py \
    --pretrained_model_name_or_path=/path/to/weight \
    --per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
    --max_train_steps=50000 \
    --width=512 \
    --height=320 \
    --checkpointing_steps=1000 --checkpoints_total_limit=1 \
    --learning_rate=1e-5 --lr_warmup_steps=0 \
    --seed=123 \
    --mixed_precision="fp16" \
    --validation_steps=200

KKN18 commented 6 months ago

Oh I see. Thank you!

chenbinghui1 commented 5 months ago

@pixeli99 hello, I try to use multiple-gpus, while the following error shows up:

The command is like: CUDA_VISIBLE_DEVICES=0,1 accelerate launch ashui_train_svd.py --pretrained_model_name_or_path="models/svd" --pretrain_unet="models/svd_unet_11channels" --gradient_checkpointing --per_gpu_batch_size=1 --gradient_accumulation_steps=2 --max_train_steps=400000 --num_frames=25 --width=512 --height=896 --checkpointing_steps=10000 --checkpoints_total_limit=100 --learning_rate=1e-5 --lr_warmup_steps=0 --seed=42 --mixed_precision="fp16" --validation_steps=200000 --output_dir="./outputs/svd"

while, when only using one gpu like CUDA_VISIBLE_DEVICES=0, the training goes correct. So do I miss something?