Open imgaojun opened 1 year ago
Same error. I run into this problem when I set TP=1 and bf16. By the way, do you encounter assert all_groups_norm > 0
when using bf16?
I meet the same problem, too.
A quick but ugly solution is to comment out apex version adam and use nightly build torch version adam with fused
option as true in this file https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/optimizer/__init__.py.
I tried this workaround https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/249
Same error. I run into this problem when I set TP=1 and bf16. By the way, do you encounter
assert all_groups_norm > 0
when using bf16?
Yes, all_group_norm assertion when training Llama 7B @KenwayZZZ
Same error. I run into this problem when I set TP=1 and bf16. By the way, do you encounter
assert all_groups_norm > 0
when using bf16?Yes, all_group_norm assertion when training Llama 7B @KenwayZZZ
Have you found a workaround for this? I get the same error when using GPT-J-6B
I'd suggest you to replace Adam implementation using pytorch one with 64bit int indexing.
Same error. I run into this problem when I set TP=1 and bf16. By the way, do you encounter
assert all_groups_norm > 0
when using bf16?Yes, all_group_norm assertion when training Llama 7B @KenwayZZZ
Have you found a workaround for this? I get the same error when using GPT-J-6B
not yet, I've tried torch.AdamW(fused=True), but without luck. "all_group_norm assertion" occurs after dozens of steps @au-revoir @Godricly
Which version of torch were you use? I tried with nightly build 2.2.0 version before and it worked.
While running a llama2 pretraining script with specific configurations, I encountered an illegal memory access error. The detailed error message is as follows:
Configuration Details