How to disable Checkpointing for Full tuning or PEFT runs?

pytorch / torchtune

PyTorch native finetuning library

https://pytorch.org/torchtune/main/

BSD 3-Clause "New" or "Revised" License

4.24k stars 416 forks source link

How to disable Checkpointing for Full tuning or PEFT runs? #1487

Open premmotgi opened 1 month ago

premmotgi commented 1 month ago

I am trying to run single GPU to multinode distributed fine tuning for Llama3-70B and Llama3 8B Models.

Below is my training configuration: SFT (Llama3 8B & 70B)

Epochs: 3 Gradient Accumulation Steps: 1 Batch Size: 32 Data Type: bf16 Enable Activation Checkpointing: True Memory Efficient FSDP Wrap: True FSDP: Enabled LoRA (Llama3 8B & 70B)

Epochs: 3 Gradient Accumulation Steps: 1 Batch Size: 32 Data Type: bf16 Enable Activation Checkpointing: True Learning Rate: 3e-4 FSDP: Enabled

Obervation/Error:

I see that after each epoch the tune module is saving checkpoint to NFS share. It takes around 13 minutes on an average after each epoch before the next epoch starts. Due to this I am not able to capture accurate training time for various runs.

Is there any way I could disable checkpointing at all?

SalmanMohammadi commented 1 month ago

Hey @premmotgi. You can use the save_adapter_weights_only config entry (i.e. save_adapter_weights_only=True from the CLI) to only save adapter weights when checkpointing, which should be a fraction of the size.

Saving to an NFS share may also be a lot slower than a local drive - are you getting good transfer speeds to your NFS share?

premmotgi commented 1 month ago

@SalmanMohammadi I understood the use of save_adapter_weights_only = True. does this apply only to LoRa or applies to full tuning as well?

Also, I still think it's worth to have an enabler/disabler for checkpointing as in development phase not everyone would want to save tuned model as it takes space after every epoch. Or maybe have a checkpointing interval config so that checkpointing can be done based on that config.

Saving to an NFS share may also be a lot slower than a local drive - are you getting good transfer speeds to your NFS share?

For multinode runs we found it easier to rely on network to store distributed checkpoints to one NFS share as it would help us having tuned model .pt files at one location.
Also based on our achitecture we have high speed IB network that allows us to get real time asynchronous checkpointing closer to local speed

felipemello1 commented 1 month ago

Also, I still think it's worth to have an enabler/disabler for checkpointing as in development phase not everyone would want to save tuned model as it takes space after every epoch. Or maybe have a checkpointing interval config so that checkpointing can be done based on that config.

Thanks for raising the issue @premmotgi. We currently dont have a flag for it, but i share your frustration. Usually i go to the recipe and just comment out the line. If you are interested in contributing, we could add a flag to control the checkpoint.

If do you add the flag, here is a script to update all configs in bulk to include the flag: https://gist.github.com/felipemello1/5f2002433c6da3a21f33d6cdf82e702a

pytorch also has a functionality to save the checkpoint async, but we havent explored it, afaik.

premmotgi commented 1 month ago

@felipemello1 No worries at all. I just figured that it would be useful to many in the experimental case of finetuning larger models such as 70b and want to save some storage space. I will add a flag and update the configs in bulk. thanks for sharing the script.