microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.6k stars 3.94k forks source link

BF16 training benchmarks. #4904

Open ethansmith2000 opened 5 months ago

ethansmith2000 commented 5 months ago

I couldn't find any documentation on benchmarks around bf16/fp16 training, it caught me a bit off guard as i noticed the weights themselves are put in that precision which is different from the usual mixed precision schemes ive worked with.

I benchmarked on BingBertSquad example with the following config

args = { "seed": 42, "train_batch_size": 3, "gradient_accumulation_steps": 1, "do_lower_case": True, "bert_model": "bert-base-uncased", "dropout_p": 0.1, "train_file": "/efs/squad/train-v1.1.json", "predict_file": "/efs/squad/dev-v1.1.json", "num_train_epochs": 1, "output_dir": "/efs/squad/output", "max_seq_length": 384, "doc_stride": 128, "max_query_length": 64, "loss_plot_alpha": 0.9, "warmup_proportion": 0.1, "learning_rate": 3e-5, "print_steps": 100, "predict_batch_size": 8, "n_best_size": 20, "max_answer_length": 30, "verbose_logging": 1, "job_name": "squad", "max_steps": 99999999999, "max_steps_per_epoch": 99999999999, }

and ds_config
`ds_config = {
"gradient_accumulation_steps": 1,
"steps_per_print": 2000,
"optimizer": {
  "type": "Adam",
  "params": {
    "lr": 0.0001,
    "betas": [
      0.9,
      0.999,
    ],
    "eps": 1e-8,
    "weight_decay": 0.01
  }
},
"scheduler": {
  "type": "WarmupLR",
  "params": {
    "warmup_min_lr": 0,
    "warmup_max_lr": 0.00001,
    "warmup_num_steps": 1000
  }
},
"gradient_clipping": 1.0,
"prescale_gradients": False,
"bf16": {
    "enabled": True
},
"wall_clock_breakdown": False,

}`

only changing the bf16 on/off

these are the scores BF16 {"exact_match": 64.38032166508988, "f1": 74.72591312745307} FP32 {"exact_match": 77.09555345316934, "f1": 85.53713988174498} FP16 {“exact_match”: 77.12393566698202, “f1": 85.45339512076741}

I see that the AMP automatic mixed precision within the deepspeed config is not compatible with Zero, but is that a hard limitation? as in if i were to go about manually casting everything would it work?

stas00 commented 5 months ago

I see that the AMP automatic mixed precision within the deepspeed config is not compatible with Zero, but is that a hard limitation? as in if i were to go about manually casting everything would it work?

Deepspeed implements its own variation of AMP, so if you look in any integration libraries (Accelerate or HF Trainer) they skip torch's AMP in case of Deepspeed.

You shouldn't need to do any manual casting, just train w/o AMP.