Open qyc-98 opened 1 year ago
Hi @qyc-98 Thank you for your feedback. We will try to reproduce your issue.
By the way, we are restructuring the documents and examples, and the new version examples will be provided at the following link https://github.com/hpcaitech/ColossalAI/tree/main/examples
🐛 Describe the bug
Hi
I'm training bert using sequence parallel in colossal ai according to this link. But my training loss is too large, and it seems the training loss grows linearly with the number of sequence parallel sizes.
when my setting is: parallel = dict(pipeline=1, tensor=dict(size=8, mode='sequence')) the training loss in the beginning was and after 2330 steps the training loss is 13.044
when my setting is: parallel = dict(pipeline=1, tensor=dict(size=2, mode='sequence')) after 2330 steps the training loss is 13.044
when my setting is: parallel = dict(pipeline=1, tensor=dict(size=1, mode='sequence')) after 2330 steps the training loss is 6.5549
Environment
after running colossalai check -i I got![image](https://user-images.githubusercontent.com/38046403/178687051-45e3b5a9-4c0d-4b44-ba11-65b248f4b54d.png)
my device is 8 rtx3090 training batch is 128 across three sequence parallel settings.
my training config is
Thanks!