Describe the bug: If you run the optimizer merging from TP=4 to TP=2 about three times, it raises an error two out of those times, like bellow. I try sorting the names, and it somehow just works. I rerun this around ten times, and the error no longer occurs!!!
Reproduce:
Resume training from an optimizer state with TP=4 to TP=2 from an existing checkpoint:
Describe the bug: If you run the optimizer merging from TP=4 to TP=2 about three times, it raises an error two out of those times, like bellow. I try sorting the names, and it somehow just works. I rerun this around ten times, and the error no longer occurs!!!
Reproduce:
The error: