Got the following error, which is identical to a previous issue #1428
/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/torch/distributed/fsdp/_state_dict_utils.py:312: UserWarning: Failed to clone() tensor with name lm_head.weight on rank 2. This may mean that this state_dict entry could point to invalid memory regions after returning from state_dict() call if this parameter is managed by FSDP. Please check clone implementation of lm_head.weight. Error: CUDA out of memory. Tried to allocate 500.00 MiB (GPU 2; 39.39 GiB total capacity; 37.52 GiB already allocated; 58.00 MiB free; 38.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
warnings.warn(
Traceback (most recent call last):
File "/home/kun/llm_fine_tune/vicuna-finetune/fast-chat-training/FastChat/fastchat/train/train_mem.py", line 13, in <module>
train()
File "/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/fastchat/train/train.py", line 284, in train
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
File "/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/fastchat/train/train.py", line 76, in safe_save_model_for_hf_trainer
cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
File "/home/kun/anaconda3/envs/llm-fine-tune/lib/python3.9/site-packages/fastchat/train/train.py", line 76, in <dictcomp>
cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
while checking the cuda usage, it shows that the memory usage distribution is highly skrew, GPU 7 (in my case) reached full capacity while the other GPUs are not fully used. Any ideas?
script used to finetune
lmsys/vicuna-7b-v1.5
Got the following error, which is identical to a previous issue #1428
while checking the cuda usage, it shows that the memory usage distribution is highly skrew, GPU 7 (in my case) reached full capacity while the other GPUs are not fully used. Any ideas?