I've noticed that there is --use_transformer_engine2 flag disabled for multi-node training greater than 8 in the configurations. I've also noticed that it is also slower when I enable transformer engine in this case. Can anyone point out why FP8 training is slower in this case?
I've noticed that there is
--use_transformer_engine2 flag
disabled for multi-node training greater than 8 in the configurations. I've also noticed that it is also slower when I enable transformer engine in this case. Can anyone point out why FP8 training is slower in this case?