Open premmotgi opened 1 month ago
Hey @premmotgi. You can use the save_adapter_weights_only
config entry (i.e. save_adapter_weights_only=True
from the CLI) to only save adapter weights when checkpointing, which should be a fraction of the size.
Saving to an NFS share may also be a lot slower than a local drive - are you getting good transfer speeds to your NFS share?
@SalmanMohammadi I understood the use of save_adapter_weights_only = True. does this apply only to LoRa or applies to full tuning as well?
Also, I still think it's worth to have an enabler/disabler for checkpointing as in development phase not everyone would want to save tuned model as it takes space after every epoch. Or maybe have a checkpointing interval config so that checkpointing can be done based on that config.
Saving to an NFS share may also be a lot slower than a local drive - are you getting good transfer speeds to your NFS share?
Also, I still think it's worth to have an enabler/disabler for checkpointing as in development phase not everyone would want to save tuned model as it takes space after every epoch. Or maybe have a checkpointing interval config so that checkpointing can be done based on that config.
Thanks for raising the issue @premmotgi. We currently dont have a flag for it, but i share your frustration. Usually i go to the recipe and just comment out the line. If you are interested in contributing, we could add a flag to control the checkpoint.
If do you add the flag, here is a script to update all configs in bulk to include the flag: https://gist.github.com/felipemello1/5f2002433c6da3a21f33d6cdf82e702a
pytorch also has a functionality to save the checkpoint async, but we havent explored it, afaik.
@felipemello1 No worries at all. I just figured that it would be useful to many in the experimental case of finetuning larger models such as 70b and want to save some storage space. I will add a flag and update the configs in bulk. thanks for sharing the script.
I am trying to run single GPU to multinode distributed fine tuning for Llama3-70B and Llama3 8B Models.
Below is my training configuration: SFT (Llama3 8B & 70B)
Epochs: 3 Gradient Accumulation Steps: 1 Batch Size: 32 Data Type: bf16 Enable Activation Checkpointing: True Memory Efficient FSDP Wrap: True FSDP: Enabled LoRA (Llama3 8B & 70B)
Epochs: 3 Gradient Accumulation Steps: 1 Batch Size: 32 Data Type: bf16 Enable Activation Checkpointing: True Learning Rate: 3e-4 FSDP: Enabled
Obervation/Error:
I see that after each epoch the tune module is saving checkpoint to NFS share. It takes around 13 minutes on an average after each epoch before the next epoch starts. Due to this I am not able to capture accurate training time for various runs.
Is there any way I could disable checkpointing at all?