CUDA out of memory while training

sonnv174 commented 6 months ago

My machine: 2xA5000 (2x24GB) train_batch_size = 1 I already got the loss value while training (stable with ~15GB for each GPU), but suddenly CUDA was out of memory.

Accelerate config:

compute_environment: LOCAL_MACHINE debug: false distributed_type: DEEPSPEED downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

Log: log.txt

zhangzhiyuan303 commented 6 months ago

I want to know the GPU configuration required for training； I want to know the GPU configuration required for training, where can I rent it

tengshaofeng commented 6 months ago

@sonnv174 same error，Have you solve it？

lyc0929 / OOTDiffusion-train

CUDA out of memory while training #24