BF16 training benchmarks.

I couldn't find any documentation on benchmarks around bf16/fp16 training, it caught me a bit off guard as i noticed the weights themselves are put in that precision which is different from the usual mixed precision schemes ive worked with.

I benchmarked on BingBertSquad example with the following config

args = { "seed": 42, "train_batch_size": 3, "gradient_accumulation_steps": 1, "do_lower_case": True, "bert_model": "bert-base-uncased", "dropout_p": 0.1, "train_file": "/efs/squad/train-v1.1.json", "predict_file": "/efs/squad/dev-v1.1.json", "num_train_epochs": 1, "output_dir": "/efs/squad/output", "max_seq_length": 384, "doc_stride": 128, "max_query_length": 64, "loss_plot_alpha": 0.9, "warmup_proportion": 0.1, "learning_rate": 3e-5, "print_steps": 100, "predict_batch_size": 8, "n_best_size": 20, "max_answer_length": 30, "verbose_logging": 1, "job_name": "squad", "max_steps": 99999999999, "max_steps_per_epoch": 99999999999, }

and ds_config
`ds_config = {
"gradient_accumulation_steps": 1,
"steps_per_print": 2000,
"optimizer": {
  "type": "Adam",
  "params": {
    "lr": 0.0001,
    "betas": [
      0.9,
      0.999,
    ],
    "eps": 1e-8,
    "weight_decay": 0.01
  }
},
"scheduler": {
  "type": "WarmupLR",
  "params": {
    "warmup_min_lr": 0,
    "warmup_max_lr": 0.00001,
    "warmup_num_steps": 1000
  }
},
"gradient_clipping": 1.0,
"prescale_gradients": False,
"bf16": {
    "enabled": True
},
"wall_clock_breakdown": False,

only changing the bf16 on/off

these are the scores BF16 {"exact_match": 64.38032166508988, "f1": 74.72591312745307} FP32 {"exact_match": 77.09555345316934, "f1": 85.53713988174498} FP16 {“exact_match”: 77.12393566698202, “f1": 85.45339512076741}

I see that the AMP automatic mixed precision within the deepspeed config is not compatible with Zero, but is that a hard limitation? as in if i were to go about manually casting everything would it work?

microsoft / DeepSpeed

BF16 training benchmarks. #4904