Open sonnv174 opened 6 months ago
My machine: 2xA5000 (2x24GB) train_batch_size = 1 I already got the loss value while training (stable with ~15GB for each GPU), but suddenly CUDA was out of memory.
Accelerate config:
compute_environment: LOCAL_MACHINE debug: false distributed_type: DEEPSPEED downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
Log: log.txt
I want to know the GPU configuration required for training; I want to know the GPU configuration required for training, where can I rent it
@sonnv174 same error,Have you solve it?
My machine: 2xA5000 (2x24GB) train_batch_size = 1 I already got the loss value while training (stable with ~15GB for each GPU), but suddenly CUDA was out of memory.
Accelerate config:
Log: log.txt