l294265421 / alpaca-rlhf

Finetuning LLaMA with RLHF (Reinforcement Learning with Human Feedback) based on DeepSpeed Chat
https://88aeeb3aef5040507e.gradio.live/
MIT License
106 stars 13 forks source link

v100训练时显存oom #3

Closed iamsile closed 1 year ago

iamsile commented 1 year ago

您好,我用v100训练sft和rm时都说显存不够无法运行,具体报错信息如下: OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 1; 31.75 GiB total capacity; 29.88 GiB already allocated; 11.75 MiB free; 29.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory

我已经将per_device_train_batch_size和per_device_eval_batch_size调到1了,但仍然提示说显存不够,请问有什么办法解这个问题吗?

l294265421 commented 1 year ago
  1. --max_seq_len 设小
  2. --lora_dim 设小,但大于0
  3. --lora_module_name 只设置q_proj
iamsile commented 1 year ago

多谢大佬的答复