I followed training a T5 model with FSDP on Sagemaker from the example https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py
I noticed that checkpointing is not done with save_strategy="no". Is it intentional(line https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py#L93)? In my training I changed it to save_strategy="steps" and noticed two issues
Best checkpoints based on min validation loss is not saved. If I set the limit to 2 for e.g., the last 2 checkpoints are saved
I was not able to load the trained model from checkpoint and got the error which is mentioned elsewhere in issues RuntimeError: Trying to resize storage that is not resizable. This does not happen if I want to load the final model. But it makes training hard since I need to know when to stop training so that I have the final model withe the minimum loss saved. I tried with different versions
PyTorch 1.13
Transformers 4.26
and
PyTorch 2.0.0
Transformers 4.28.1
and see the same issue with loading a model from checkpoint.
Hi there!
I followed training a T5 model with FSDP on Sagemaker from the example
https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py
I noticed that checkpointing is not done with
save_strategy="no"
. Is it intentional(linehttps://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py#L93
)? In my training I changed it tosave_strategy="steps"
and noticed two issuesRuntimeError: Trying to resize storage that is not resizable
. This does not happen if I want to load the final model. But it makes training hard since I need to know when to stop training so that I have the final model withe the minimum loss saved. I tried with different versionsand
and see the same issue with loading a model from checkpoint.
Would appreciate any pointers
Thank you!