huggingface / notebooks

Notebooks using the Hugging Face libraries 🤗
Apache License 2.0
3.44k stars 1.46k forks source link

FSDP training not loading saving the best checkpoint #472

Open BSharmi opened 5 months ago

BSharmi commented 5 months ago

Hi there!

I followed training a T5 model with FSDP on Sagemaker from the example https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py

I noticed that checkpointing is not done with save_strategy="no". Is it intentional(line https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py#L93)? In my training I changed it to save_strategy="steps" and noticed two issues

  1. Best checkpoints based on min validation loss is not saved. If I set the limit to 2 for e.g., the last 2 checkpoints are saved
  2. I was not able to load the trained model from checkpoint and got the error which is mentioned elsewhere in issues RuntimeError: Trying to resize storage that is not resizable. This does not happen if I want to load the final model. But it makes training hard since I need to know when to stop training so that I have the final model withe the minimum loss saved. I tried with different versions
    PyTorch 1.13
    Transformers 4.26

and

PyTorch 2.0.0
Transformers 4.28.1

and see the same issue with loading a model from checkpoint.

Would appreciate any pointers

Thank you!