minimaxir / aitextgen

A robust Python tool for text-based AI training and generation using GPT-2.
https://docs.aitextgen.io
MIT License
1.84k stars 218 forks source link

DeepSpeed + TPU support via transformer's Trainer #97

Open minimaxir opened 3 years ago

minimaxir commented 3 years ago

Currently, training via pytorch-lightning's implementation of DeepSpeed/TPUs is not working, and it's impossible to debug where the issues lie (i.e. within aitextgen, transformers, pytorch-lightning, or pytorch-xla) since the entire ecosystem is very fragile and error messages are unhelpful.

A short-term workaround is to use transformer's native Trainer for DeepSpeed + TPUs (and only those specific use cases for now) as it limits potential breakage, and also serves as a baseline for pytorch-lightning's approach when that is more stabilized.

The downside is that Trainer is not as good as pytorch-lightning UX-wise, but given that DeepSpeed + TPUs are a more niche use case for power users. that's acceptable.

minimaxir commented 3 years ago

Now Zero-3 Offload is available which in theory should be easier to implement (once it works with base Transformers)

williamFalcon commented 3 years ago

thanks for highlighting this!

@SeanNaren can help get this solved on the PL side.

SeanNaren commented 3 years ago

hey @minimaxir what issues are you running into? If you're able to point to issues I can help escalate/resolve them for PL!

ZeRO 3 Offload has it's own quirks that will require HuggingFace Transformers and us both to figure out, so it may take a bit longer to integrate however we're working together on this where we can. We do have experimental support in place, and can give some pointers if you're keen to try :)

tchaton commented 3 years ago

Dear @minimaxir,

Would you mind joining Pytorch Lightning Slack. I sent you an invitation. We can coordinate efforts there to resolve your issues with Sean and I.

Best, T.C