pacman100 / LLM-Workshop

LLM Workshop by Sourab Mangrulkar
Apache License 2.0
325 stars 112 forks source link

Error on save_steps using FSDP #6

Open ghost opened 11 months ago

ghost commented 11 months ago

I am currently using the FSDP (Fully Sharded Data Parallelism) approach with the Llama 2 70B model. The training process has begun, but I encounter an error when attempting to save the checkpoint at each save_step. I have set the save_step as 50.

System: 1 Node with 2 A100 80 GB GPU

Here are the supporting screenshots MicrosoftTeams-image (1)

MicrosoftTeams-image (2)

@pacman100

sachalevy commented 11 months ago

Hey @keval2415, I'm seeing the same thing on my end, except I'm running Llama 2 7B and on 2 A100 40GB GPUs. Have you been able to solve the issue?

sachalevy commented 11 months ago

Hi @keval2415, just posting this in case anyone else runs into this issue. I found that this was most likely related to checkpointing the optimizer states in fsdp (described in this issue and solved in this pr).

I solved it by upgrading my pytorch version from 2.0.1 to 2.1.0.

kevaldekivadiya2415 commented 11 months ago

Thanks @sachalevy, Could you please share the clean entire code with me, because I am still getting different errors like lora is not supported with FSDP. MicrosoftTeams-image (3)