v100训练时显存oom - Githubissues

iamsile commented 1 year ago

您好，我用v100训练sft和rm时都说显存不够无法运行，具体报错信息如下： OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 1; 31.75 GiB total capacity; 29.88 GiB already allocated; 11.75 MiB free; 29.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory

我已经将per_device_train_batch_size和per_device_eval_batch_size调到1了，但仍然提示说显存不够，请问有什么办法解这个问题吗？