Closed shrinath-suresh closed 2 years ago
Hello @shrinath-suresh , this issue has to be fixed from PyTorch side. The issue raised with PyTorch has been linked above.
Also, when using auto_wrap
please specify either --fsdp_transformer_layer_cls_to_wrap <value>
or --fsdp_min_num_params <number>
as part of cmd arguments. This is what enables sharding of parameters, gradients and optimizer state across GPUs so that peak memory usage is further decreased drastically and you get the most out of using FSDP. For more details, please refer https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html and https://pytorch.org/docs/1.12/fsdp.html?highlight=fsdp#module-torch.distributed.fsdp.
🤗 Trainer FSDP integration doc is being updated to reflect the recent updates in this PR https://github.com/huggingface/transformers/pull/18521. Please refer it for more details.
Thanks for raising this issue! I responded in PT: https://github.com/pytorch/pytorch/issues/82963. Although, not sure if HF uses nightlies/latest PT or a stable version. If we can't get pytorch updated in HF to include the fix, could we work around this by changing
model.load_state_dict(state_dict, strict=False)
to
model.load_state_dict(state_dict, False)
@rohan-varma Thank you very much. I applied the fix as given in the screenshot and compiled from source. The model is gettting saved in the fsdp mode.
Attached image and logs for the same
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This should be fixed in PyTorch nightly now: https://github.com/pytorch/pytorch/pull/83309
System Info
Who can help?
@sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Steps to reproduce the behaviour:
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install .
cd examples/pytorch/image-classification
Expected behavior
Model should get finetuned and saved successfully.
However, the following error is produced
Full example log - fsdp_error.txt
Torch environment details:
the issue seems to be appearing after this commit .