Open OhadRubin opened 2 years ago
Once the next release of PyTorch XLA is out, we'll start taking a look at this
Hey @muellerzr, is there ongoing work for adding XLA support to FSDP? We, on the AWS SageMaker training compiler side, have started looking into XLA-FSDP and might be able to contribute to adding such support to accelerate.
@Vatshank not yet! It's the next thing on my list to get to after TPU pod support, so would love the help if you guys can! 🙏
Okay cool @muellerzr! Although our focus is on GPUs, I am sure there will be significant overlap in the code for adding support for either device type.
What do you think would be a good way to discuss some of these implementation details? If you guys have a shared Slack group for development, for instance. Also happy to continue to bug you on GitHub, if that's preferred :)
@Vatshank this gh issue should be fine!
@AlexWertheim With your recent pr can we call this request done?
@AlexWertheim With your recent pr can we call this request done?
Yeah, I think so. For reference, the PR in question can be seen here. @muellerzr can say better than I can whether this fulfills all requirements where accelerate is concerned.
A recent contribution to the pytorch_xla repo allows using FSDP in PyTorch XLA for sharding Module parameters across data-parallel workers. https://github.com/pytorch/xla/pull/3431 Some motivation behind this: It may be possible perform inference with OPT 30B on Google Colab without needing a Pro subscription, which I think many people will appreciate. What will be needed to add it to accelerate?