Open ghost opened 11 months ago
Hey @keval2415, I'm seeing the same thing on my end, except I'm running Llama 2 7B and on 2 A100 40GB GPUs. Have you been able to solve the issue?
Hi @keval2415, just posting this in case anyone else runs into this issue. I found that this was most likely related to checkpointing the optimizer states in fsdp (described in this issue and solved in this pr).
I solved it by upgrading my pytorch version from 2.0.1
to 2.1.0
.
Thanks @sachalevy, Could you please share the clean entire code with me, because I am still getting different errors like lora is not supported with FSDP.
I am currently using the FSDP (Fully Sharded Data Parallelism) approach with the Llama 2 70B model. The training process has begun, but I encounter an error when attempting to save the checkpoint at each save_step. I have set the save_step as 50.
System: 1 Node with 2 A100 80 GB GPU
Here are the supporting screenshots
@pacman100