hpcaitech / ColossalAI-Examples

Examples of training models with hybrid parallelism using ColossalAI
Apache License 2.0
334 stars 102 forks source link

Too large training loss #155

Open qyc-98 opened 1 year ago

qyc-98 commented 1 year ago

🐛 Describe the bug

Hi

I'm training bert using sequence parallel in colossal ai according to this link. But my training loss is too large, and it seems the training loss grows linearly with the number of sequence parallel sizes.

when my setting is: parallel = dict(pipeline=1, tensor=dict(size=8, mode='sequence')) the training loss in the beginning was and after 2330 steps the training loss is 13.044

when my setting is: parallel = dict(pipeline=1, tensor=dict(size=2, mode='sequence')) after 2330 steps the training loss is 13.044

when my setting is: parallel = dict(pipeline=1, tensor=dict(size=1, mode='sequence')) after 2330 steps the training loss is 6.5549

Environment

after running colossalai check -i I got image

my device is 8 rtx3090 training batch is 128 across three sequence parallel settings.

my training config is

image

Thanks!

binmakeswell commented 1 year ago

Hi @qyc-98 Thank you for your feedback. We will try to reproduce your issue.

By the way, we are restructuring the documents and examples, and the new version examples will be provided at the following link https://github.com/hpcaitech/ColossalAI/tree/main/examples