mlcommons / training_results_v1.0

This repository contains the results and code for the MLPerf™ Training v1.0 benchmark.
https://mlcommons.org/en/training-normal-10/
Apache License 2.0
37 stars 43 forks source link

mlcommons training_results_v1.0 pytorch bert model fails to run on V100 multi-GPU #3

Closed terU3760 closed 2 years ago

terU3760 commented 3 years ago

Hi, all. Tried to run the mlcommons training_results_v1.0 pytorch bert model on V100 multi-GPU, but failed. Have modified the script of run_test.sh to as the following:

#!/bin/bash

python -m torch.distributed.launch --nproc_per_node=2 \
    /workspace/bert/run_pretraining.py \
    --seed=42 \
    --do_train \
    --target_mlm_accuracy=0.714 \
    --skip_checkpoint \
    --output_dir=/results \
    --fp16 \
    --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
    --gradient_accumulation_steps=1 \
    --log_freq=1 \
    --train_batch_size=4 \
    --learning_rate=4e-5 \
    --warmup_proportion=1.0 \
    --input_dir=/data/2048_shards_uncompressed  \
    --phase2 \
    --max_seq_length=512 \
    --max_predictions_per_seq=76 \
    --max_steps=100 \
    --init_checkpoint=/data/model.ckpt-28252.pt \
    --bert_config_path=/data/bert_config.json \
    --distributed_lamb   --dwu-num-rs-pg=1 --dwu-num-ar-pg=1 --dwu-num-blocks=1  \
    --eval_iter_start_samples=100000 --eval_iter_samples=100000 \
    --eval_batch_size=16 --eval_dir=/data/2048_shards_uncompressed \
    --fp16 --fused_gelu_bias --fused_mha --dense_seq_output --unpad --unpad_fmha --exchange_padding \

and run, but reports the following error:

......
......
......
......
......
......
Torch distributed is available.
Torch distributed is initialized.
Torch distributed is available.
Torch distributed is initialized.
Traceback (most recent call last):
Traceback (most recent call last):
  File "/workspace/bert/run_pretraining.py", line 1592, in <module>
  File "/workspace/bert/run_pretraining.py", line 1592, in <module>
    args, final_loss, train_time_raw = main()
  File "/workspace/bert/run_pretraining.py", line 1141, in main
    args, final_loss, train_time_raw = main()
  File "/workspace/bert/run_pretraining.py", line 1141, in main
    model = fwd_loss_bwd_trainer.capture_bert_model_segment_graph(model, use_cuda_graph)
  File "/workspace/bert/fwd_loss_bwd_trainer.py", line 43, in capture_bert_model_segment_graph
        model = fwd_loss_bwd_trainer.capture_bert_model_segment_graph(model, use_cuda_graph)bert_model_segment = graph(bert_model_segment,

  File "/workspace/bert/fwd_loss_bwd_trainer.py", line 43, in capture_bert_model_segment_graph
  File "/workspace/bert/function.py", line 66, in graph
    bert_model_segment = graph(bert_model_segment,
  File "/workspace/bert/function.py", line 66, in graph
        outputs  = func_or_module(*sample_args)outputs  = func_or_module(*sample_args)

  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1009, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1009, in forward
    sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 901, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 901, in forward
    encoded_layers = self.encoder(embedding_output,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    encoded_layers = self.encoder(embedding_output,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 577, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 577, in forward
    hidden_states = layer_module(hidden_states, cu_seqlens, actual_seqlens, maxseqlen_in_batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    hidden_states = layer_module(hidden_states, cu_seqlens, actual_seqlens, maxseqlen_in_batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 500, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 500, in forward
    attention_output = self.attention(hidden_states, attention_mask, seqlen, batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    attention_output = self.attention(hidden_states, attention_mask, seqlen, batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 424, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 424, in forward
    self_output = self.self(input_tensor, attention_mask, seqlen, batch, is_training = self.training)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    self_output = self.self(input_tensor, attention_mask, seqlen, batch, is_training = self.training)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/fmha.py", line 161, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/fmha.py", line 161, in forward
        ctx = FMHAFun.apply(qkv.view(-1, 3, self.h, self.d), cu_seqlens, seqlens, p_dropout, max_s, is_training)ctx = FMHAFun.apply(qkv.view(-1, 3, self.h, self.d), cu_seqlens, seqlens, p_dropout, max_s, is_training)

  File "/opt/conda/lib/python3.8/site-packages/apex/contrib/fmha/fmha.py", line 36, in forward
  File "/opt/conda/lib/python3.8/site-packages/apex/contrib/fmha/fmha.py", line 36, in forward
    context, S_dmask = mha.fwd(qkv, cu_seqlens, seqlens, p_dropout, max_s, is_training, None)    
context, S_dmask = mha.fwd(qkv, cu_seqlens, seqlens, p_dropout, max_s, is_training, None)
RuntimeError: RuntimeErrorExpected dprops->major == 8 && dprops->minor == 0 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.): 
Expected dprops->major == 8 && dprops->minor == 0 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
......

What could be the cause?

jqueguiner commented 2 years ago

Hi @terU3760 this is related to pros = device properties/capabilties of the CUDA device you are running the benchmark on.

Basically its expected to be a 8 (major CUDA DEVICE VERSION) 0 (minor CUDA DEVICE VERSION)

If I remember I think V100 is 7.5 while A100 are 8.0

I wrote a small tool to get device props I will share it later

image

jqueguiner commented 2 years ago

also be aware that its very likely that you'll have to change the batch size to fit into the V100 memory (v100 = 16G ; v100s = 32G)

jqueguiner commented 2 years ago
image
terU3760 commented 2 years ago

Hi, @jqueguiner . Thanks a lot! I have resolved it by myself.

jqueguiner commented 2 years ago

Out of curiosity how did you solve it ?

terU3760 commented 2 years ago

@jqueguiner Spend more money on both sides!

jqueguiner commented 2 years ago

--verbose ?

terU3760 commented 2 years ago

@jqueguiner I have said. Spend more money on both sides. After we replace V100 with A100, the error has never occurred again and we solved the problem.

jqueguiner commented 2 years ago

yes so moving to A100 makes it move to CUDA 8.0 capabilities ;-) thanks a lot for the reply !