[BUG] I can't run fp8 with pipeline parallel

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

https://www.deepspeed.ai/

Apache License 2.0

34.71k stars 4.05k forks source link

[BUG] I can't run fp8 with pipeline parallel #5760

Open exnx opened 2 months ago

exnx commented 2 months ago

Hi, I am trying to use fp8 with TransformerEngine. I am using a version of GPT-Neox repo, which uses deepspeed.

I can get fp8 to run in my MLPs with model parallel, but when I use pipeline parallel, it gets hung up with no errors, it's just frozen once it tries to train (and goes into the DeepSpeed library call).

I was wondering if anyone else ran into this issue. Thanks!

loadams commented 2 months ago

Hi @exnx - can you share your output so we can see where it is hung? As well as your DeepSpeed config and any more clear repo steps you have?

exnx commented 2 months ago

Hello! Sure, I put the error outputs and config (from wandb) in Google Docs. Let me know if this helpful or if you need more info. Thanks so much!!!