ssbuild / chatglm3_finetuning

Apache License 2.0
37 stars 2 forks source link

torch.cuda.OutOfMemoryError: CUDA out of memory. #31

Closed Essence9999 closed 11 months ago

Essence9999 commented 11 months ago

bash train_lora_int4.sh -m train

train_pl.yaml配置文件 image

修改了模型加载路径,加载个人数据集 运行程序一直报OOM(个人配置 A10 24G) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 386.00 MiB. GPU 0 has a total capacty of 22.02 GiB of which 165.19 MiB is free. Including non-PyTorch memory, this process has 21.86 GiB memory in use. Of the allocated memory 21.13 GiB is allocated by PyTorch, and 458.98 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Epoch 0: 0%| | 0/64 [00:09<?, ?it/s]

Essence9999 commented 11 months ago

修改max_seq_length为24,显存吃满,也报OOM

ssbuild commented 11 months ago

我测试的 torch 2.1.0 qlora好像16-18g就可以跑! pip list看一下环境

Essence9999 commented 11 months ago

我测试的 torch 2.1.0 qlora好像16-18g就可以跑! pip list看一下环境

不好意思,刚测试了一下,发现需要删除scripts/目录下的best_ckpt和outputs_pl文件夹,然后就可以正常训练。