Why not use FP8 in large multi-node settings for BERT?

mlcommons / training_results_v3.1

This repository contains the results and code for the MLPerf™ Training v3.1 benchmark.

https://mlcommons.org/benchmarks/training

Apache License 2.0

16 stars 10 forks source link

Why not use FP8 in large multi-node settings for BERT? #8

Open soonjune opened 4 months ago

soonjune commented 4 months ago

I've noticed that there is --use_transformer_engine2 flag disabled for multi-node training greater than 8 in the configurations. I've also noticed that it is also slower when I enable transformer engine in this case. Can anyone point out why FP8 training is slower in this case?