Too large training loss

🐛 Describe the bug

I'm training bert using sequence parallel in colossal ai according to this link. But my training loss is too large, and it seems the training loss grows linearly with the number of sequence parallel sizes.

when my setting is: parallel = dict(pipeline=1, tensor=dict(size=8, mode='sequence')) the training loss in the beginning was and after 2330 steps the training loss is 13.044

when my setting is: parallel = dict(pipeline=1, tensor=dict(size=2, mode='sequence')) after 2330 steps the training loss is 13.044

when my setting is: parallel = dict(pipeline=1, tensor=dict(size=1, mode='sequence')) after 2330 steps the training loss is 6.5549

Environment

after running colossalai check -i I got

my device is 8 rtx3090 training batch is 128 across three sequence parallel settings.

my training config is

Thanks!

hpcaitech / ColossalAI-Examples

Too large training loss #155

🐛 Describe the bug

Environment