FMHA error when reproducing DELL BERT benchmark

elimkwan commented 2 years ago

We tried to follow the Dell example to reproduce the Bert Training Benchmark on a server with 2 GPUs. We have encountered an error when running the model encoder layer, and it is related to the fmhalib.fwd function: Expected dprops->major == 8 && dprops->minor == 0 to be true, but got false.

The error happens in the last line:

import fmhalib as mha
class FMHAFun(torch.autograd.Function):
    @staticmethod
    def forward(ctx, qkv, cu_seqlens, p_dropout, max_s, is_training):
        b = cu_seqlens.numel() - 1

        if b < 4:
            max_s = 512
            context, S_dmask = mha.fwd_nl(qkv, cu_seqlens, p_dropout, max_s, is_training, None)
        else:
            context, S_dmask = mha.fwd(qkv, cu_seqlens, p_dropout, max_s, is_training, None)

It seems to be related to the error mentioned here, but I am not entirely sure about how to apply their fix (unpad the qkv).

System Used

CPU

CPU(s):                          48
On-line CPU(s) list:             0-47
Thread(s) per core:              2
Core(s) per socket:              24
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
NUMA node0 CPU(s):               0-47

GPU

Driver Version: 510.47.03
CUDA Version: 11.6
NVIDIA RTX A5000 x2

System:

PyTorch v1.10.1

Error Reproduction

For reproducing the error, the following settings were used. We created two config files (config_SUT.sh, config_SUT_common.sh) and ran the code interactively within a docker container.

Configs in config_SUT.sh

## DL params
export BATCHSIZE=64
export GRADIENT_STEPS=1
export LR=3.5e-4
export MAX_SAMPLES_TERMINATION=4500000
export MAX_STEPS=7100
export OPT_LAMB_BETA_1=0.9
export OPT_LAMB_BETA_2=0.999
export START_WARMUP_STEP=0
export WARMUP_PROPORTION=0.0
export EXTRA_PARAMS="--dense_seq_output --unpad --unpad_fmha --exchange_padding"
export PHASE=2
export EVAL_ITER_START_SAMPLES=150000
export EVAL_ITER_SAMPLES=150000

## System run parms
export DGXNNODES=1
export DGXSYSTEM=$(basename $(readlink -f ${BASH_SOURCE[0]}) | sed 's/^config_//' | sed 's/\.sh$//' )
export WALLTIME=01:15:00

## System config params
source config_SUT_common.sh

Configs in config_SUT_common.sh

## System config params
export DGXNGPU=2
export DGXSOCKETCORES=24
export DGXNSOCKET=1
export DGXHT=2
export SLURM_NTASKS=${DGXNGPU}

After creating the docker image mlperf-nvidia:language_model, enter the docker container with the following command:

nvidia-docker run -it --privileged --network host \
--ipc=host -v /data/bert/phase1:/workspace/phase1 \
-v /data/bert/hdf5/training-4320/hdf5_4320_shards_varlength:/workspace/data_phase2 \
--name language_model mlperf-nvidia:language_model

Running the program:

export CUDA_VISIBLE_DEVICES=0,1
export NEXP=1
source config_SUT.sh
./run_and_time.sh

Error Log:

##binding cmd: ['/usr/bin/numactl', '--physcpubind=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46', 
'--membind=0', '/opt/conda/bin/python', '-u', '/workspace/bert/run_pretraining.py', '--local_rank=0', 
'--train_batch_size=64', '--learning_rate=3.5e-4', '--opt_lamb_beta_1=0.9', '--opt_lamb_beta_2=0.999',
'--warmup_proportion=0.0', '--warmup_steps=0.0', '--start_warmup_step=0', '--max_steps=7100', '--phase2',
'--max_seq_length=512', '--max_predictions_per_seq=76', '--input_dir=/workspace/data_phase2',
'--init_checkpoint=/workspace/phase1/model.ckpt-28252.pt', '--do_train', '--skip_checkpoint',
'--train_mlm_accuracy_window_size=0', '--target_mlm_accuracy=0.720', '--weight_decay_rate=0.01',
'--max_samples_termination=4500000', '--eval_iter_start_samples=150000', '--eval_iter_samples=150000',
'--eval_batch_size=16', '--eval_dir=/workspace/evaldata', '--num_eval_examples', '10000', 
'--cache_eval_data','--output_dir=/results', '--fp16', '--fused_bias_fc', '--fused_bias_mha', 
'--fused_dropout_add', '--distributed_lamb','--dwu-num-rs-pg=1', '--dwu-num-ar-pg=1', '--dwu-num-ag-pg=1',
'--dwu-num-blocks=1', '--gradient_accumulation_steps=1', '--log_freq=0', 
'--bert_config_path=/workspace/phase1/bert_config.json', '--dense_seq_output', '--unpad', '--unpad_fmha',
'--exchange_padding', '--allreduce_post_accumulation', '--allreduce_post_accumulation_fp16', '--seed=15572']
##local_rank: 0
...
Traceback (most recent call last):
  File "/workspace/bert/run_pretraining.py", line 1744, in <module>
    args, final_loss, train_time_raw = main()
  File "/workspace/bert/run_pretraining.py", line 1237, in main
    model = fwd_loss_bwd_trainer.capture_bert_model_segment_graph(model, use_cuda_graph)
  File "/workspace/bert/fwd_loss_bwd_trainer.py", line 99, in capture_bert_model_segment_graph
    bert_model_segment = graph(bert_model_segment,
  File "/workspace/bert/function.py", line 73, in graph
    outputs  = func_or_module(*sample_args)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1095, in forward
    sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, position_ids,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 987, in forward
    encoded_layers = self.encoder(embedding_output,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 674, in forward
    hidden_states = layer_module(hidden_states, cu_seqlens, maxseqlen_in_batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 605, in forward
    attention_output = self.attention(hidden_states, attention_mask, seqlen, batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 494, in forward
    self_output = self.self(input_tensor, cu_seqlens, max_s, is_training=self.training)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/fmha.py", line 213, in forward
    ctx = FMHAFun.apply(qkv.contiguous().view(-1, 3, self.h, self.d), cu_seqlens, p_dropout, max_s, is_training)
  File "/workspace/bert/fmha.py", line 32, in forward
    context, S_dmask = mha.fwd(qkv, cu_seqlens, p_dropout, max_s, is_training, None)
RuntimeError: Expected dprops->major == 8 && dprops->minor == 0 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
ENDING TIMING RUN AT 2022-04-20 10:46:14 AM
RESULT,bert,15572,13,,2022-04-20 10:46:01 AM

erichan1 commented 2 years ago

I believe the issue is you need to pack your input sequences together. ie instead of your input looking like (batch_size, max_seq_len, ...), it should be (batch_size * seq_len, ...), with seq_len variable. So you remove all the 0 padding tokens and concatenate all your sequences together. Then you can pass in the seqlens through cu_seqlens so the MHA knows where each sequence starts and stops.

See qkv and qkv_vs: https://github.com/NVIDIA/apex/blob/master/apex/contrib/test/fmha/test_fmha.py

elimkwan commented 2 years ago

Hi Eric, thanks for the prompt reply but I believe the issue hasn't been resolved. The input has already been densely packed as

qkv.view(-1, 3, self.h, self.d)

I have checked the first dimension of the input, it is equal to the batch_size * seq_len, and unfortunately, the issue persists. I believe the dprops in the error message refer to the device's major compute capability and minor compute capability must be 8 and 0 respectively. According to Apex source code:

auto dprops = at::cuda::getCurrentDeviceProperties();
TORCH_CHECK(dprops->major == 8 && dprops->minor == 0);

There are talks on it about relaxing the check but it didn't go through. I have tried disabling the check and rebuilding the package from source, however, it seems to me that the package only support sm80 as shown in their src folder. This is the error I encountered after disabling the dprops check manually:

CUDA error (apex/contrib/csrc/fmha/src/fmha_dgrad_fp16_512_64_kernel.sm80.cu:97): invalid argument

seryilmaz commented 2 years ago

The error is coming from a kernel used in FMHA module in Apex. That code in particular is intended to run on GA100 architecture, whereas A5000 cards use GA102 architecture. GA102 has lower amount of shared memory than GA100 and the amount of shared memory required by that kernel. That's why you are getting the last error you posted, where the code requests a maximum size in bytes of dynamically allocated shared memory that is not supported by the architecture you are running on. You can disable the usage of Apex FMHA module in the code by removing --pad_fmha or --unpad_fmha (whichever you currently have as an argument). Then, a different code path will be used for multi-head attention, which is not as performant but should run fine on your card. With this change, you will also need to comment out this section of the code (or place it under a guard like if args.pad_fmha or args.unpad_fmha: ).

Please let us know if you have more issues running the code.

elimkwan commented 2 years ago

Many thanks for the reply @seryilmaz , your advice was useful in resolving the error mentioned above. However, another error emerges after that, do you have any insight on this (CUDA error: invalid configuration argument) ?

Error Log:

:::MLLOG {"namespace": "", "time_ms": 1652282257010, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1370}}
:::MLLOG {"namespace": "", "time_ms": 1652282257026, "event_type": "INTERVAL_START", "key": "run_start", "value": null, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1371}}
:::MLLOG {"namespace": "", "time_ms": 1652282257108, "event_type": "INTERVAL_START", "key": "epoch_start", "value": null, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1382, "epoch_num": 1}}
:::MLLOG {"namespace": "", "time_ms": 1652282257109, "event_type": "INTERVAL_START", "key": "block_start", "value": null, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1384, "first_epoch_num": 1, "epoch_count": 1}}
parsed args:
Namespace(allreduce_post_accumulation=True, allreduce_post_accumulation_fp16=True, bert_config_path='/workspace/phase1/bert_config.json', bert_model='bert-large-uncased', bypass_amp=False, cache_eval_data=True, checkpoint_activations=False, cuda_graph_mode='segmented', ddp_type='apex', dense_seq_output=True, device=device(type='cuda', index=0), disable_apex_softmax=False, disable_fuse_mask=False, disable_fuse_qkv=False, disable_fuse_scale=False, distributed_lamb=True, do_train=True, dwu_e5m2_allgather=False, dwu_group_size=0, dwu_num_ag_pg=1, dwu_num_ar_pg=1, dwu_num_blocks=1, dwu_num_chunks=1, dwu_num_rs_pg=1, dwu_overlap_reductions=False, enable_fuse_dropout=False, enable_stream=False, eval_batch_size=16, eval_dir='/workspace/evaldata', eval_iter_samples=150000, eval_iter_start_samples=150000, exchange_padding=True, fp16=True, fused_bias_fc=True, fused_bias_mha=True, fused_dropout_add=True, fused_gelu_bias=False, fused_mha=False, gradient_accumulation_steps=1, init_checkpoint='/workspace/phase1/model.ckpt-28252.pt', init_tf_checkpoint=None, input_dir='/workspace/data_phase2', keep_n_most_recent_checkpoints=20, learning_rate=0.00035, local_rank=0, log_freq=0.0, loss_scale=0.0, max_iterations_per_graph=4, max_predictions_per_seq=76, max_samples_termination=4500000.0, max_seq_length=512, max_steps=7100.0, min_samples_to_start_checkpoints=3000000, n_gpu=2, num_epochs_to_generate_seeds_for=2, num_eval_examples=10000, num_samples_per_checkpoint=500000, opt_lamb_beta_1=0.9, opt_lamb_beta_2=0.999, output_dir='/results', pad=False, pad_fmha=False, phase2=True, resume_from_checkpoint=False, resume_step=0, seed=7023, skip_checkpoint=True, start_warmup_step=0.0, target_mlm_accuracy=0.72, train_batch_size=16, train_mlm_accuracy_window_size=0, unpad=True, unpad_fmha=False, use_cuda_graph=False, use_ddp=False, use_env=False, use_gradient_as_bucket_view=False, warmup_proportion=0.0, warmup_steps=0.0, weight_decay_rate=0.01)
epoch: 1
/workspace/bert/model/layers/activations.py:98: UserWarning: FALLBACK path has been taken. This is an indication that codegenFailed for some reason. To debug try disable codegen fallback pathvia setting the env variable`export PYTORCH_NVFUSER_DISABLE_FALLBACK=1` (Triggered internally at  /opt/pytorch/pytorch/torch/csrc/jit/codegen/cuda/manager.cpp:305.)
  return gelu_fwd(input)
/workspace/bert/model/layers/activations.py:98: UserWarning: FALLBACK path has been taken. This is an indication that codegenFailed for some reason. To debug try disable codegen fallback pathvia setting the env variable`export PYTORCH_NVFUSER_DISABLE_FALLBACK=1` (Triggered internally at  /opt/pytorch/pytorch/torch/csrc/jit/codegen/cuda/manager.cpp:305.)
  return gelu_fwd(input)
Traceback (most recent call last):
  File "/workspace/bert/run_pretraining.py", line 1744, in <module>
    args, final_loss, train_time_raw = main()
  File "/workspace/bert/run_pretraining.py", line 1511, in main
    loss, mlm_acc, sbridge = fwd_loss_bwd_trainer.step(step,
  File "/workspace/bert/fwd_loss_bwd_trainer.py", line 149, in step
    loss, mlm_acc, _ = model(*batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1152, in forward
    return self.heads_only_segment(sequence_output, pooled_output, masked_lm_labels, next_sentence_label)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1107, in forward
    prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output, masked_lm_labels)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 792, in forward
    prediction_scores = self.predictions(sequence_output)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 753, in forward
    hidden_states = self.transform(hidden_states)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 735, in forward
    hidden_states = self.LayerNorm(hidden_states)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/apex/contrib/layer_norm/layer_norm.py", line 44, in forward
    return FastLayerNormFN.apply(x, self.weight, self.bias, self.epsilon)
  File "/opt/conda/lib/python3.8/site-packages/apex/contrib/layer_norm/layer_norm.py", line 14, in forward
    ymat, mu, rsigma = fast_layer_norm.ln_fwd(xmat, gamma, beta, epsilon)
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "/workspace/bert/run_pretraining.py", line 1744, in <module>
    args, final_loss, train_time_raw = main()
  File "/workspace/bert/run_pretraining.py", line 1511, in main
    loss, mlm_acc, sbridge = fwd_loss_bwd_trainer.step(step,
  File "/workspace/bert/fwd_loss_bwd_trainer.py", line 149, in step
    loss, mlm_acc, _ = model(*batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1152, in forward
    return self.heads_only_segment(sequence_output, pooled_output, masked_lm_labels, next_sentence_label)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1107, in forward
    prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output, masked_lm_labels)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 792, in forward
    prediction_scores = self.predictions(sequence_output)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 753, in forward
    hidden_states = self.transform(hidden_states)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 735, in forward
    hidden_states = self.LayerNorm(hidden_states)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/apex/contrib/layer_norm/layer_norm.py", line 44, in forward
    return FastLayerNormFN.apply(x, self.weight, self.bias, self.epsilon)
  File "/opt/conda/lib/python3.8/site-packages/apex/contrib/layer_norm/layer_norm.py", line 14, in forward
    ymat, mu, rsigma = fast_layer_norm.ln_fwd(xmat, gamma, beta, epsilon)
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ENDING TIMING RUN AT 2022-05-11 03:17:39 PM
RESULT,bert,7023,17,,2022-05-11 03:17:22 PM

hanyunfan commented 2 years ago

Just want to put a side note. The reproducibility only applied with the same HW. You have different configurations ex. GPU: NVIDIA RTX A5000 x2 vs A100x4. I don't think A5000 is really supported by the code or ever verified by anyone.

seryilmaz commented 2 years ago

I am not able to reproduce this error with the container used in v1.1. @elimkwan can you try running with CUDA_LAUNCH_BLOCKING=1 to have a better idea about where the error might be coming from? As a side note, I would use a smaller batch size, batch 64 might be too large for A5000 which only has 24 GB. I recommend starting with batch 16.

arjunsuresh commented 1 year ago

I was trying bert implementation on RTX 4090 and after following the suggestions here I'm able to get a run going but the accuracy of the model is not improving :) Is it possible to train bert on a single RTX 4090?

:::MLLOG {"namespace": "", "time_ms": 1684433934190, "event_type": "POINT_IN_TIME", "key": "data_file", "value": "/workspace/data_phase2/part_04253_of_04320.hdf", "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1569}}
:::MLLOG {"namespace": "", "time_ms": 1684436040878, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.34087124466896057, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1722, "epoch_num": 150848}}
{'global_steps': 2344, 'eval_loss': 4.732451915740967, 'eval_mlm_accuracy': 0.34087124466896057}

:::MLLOG {"namespace": "", "time_ms": 1684438136054, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.34087124466896057, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1722, "epoch_num": 301056}}
{'global_steps': 4688, 'eval_loss': 4.732451915740967, 'eval_mlm_accuracy': 0.34087124466896057}

arjunsuresh commented 1 year ago

This suggestion fixed the issue for us. We are seeing sensible output now

:::MLLOG {"namespace": "", "time_ms": 1684447292996, "event_type": "POINT_IN_TIME", "key": "data_file", "value": "/workspace/data_phase2/part_03578_of_04320.hdf", "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1569}}
:::MLLOG {"namespace": "", "time_ms": 1684449399584, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.3959695100784302, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1722, "epoch_num": 150656}}
{'global_steps': 2344, 'eval_loss': 4.141561508178711, 'eval_mlm_accuracy': 0.3959695100784302}

xihajun commented 1 year ago

This suggestion fixed the issue for us. We are seeing sensible output now

:::MLLOG {"namespace": "", "time_ms": 1684447292996, "event_type": "POINT_IN_TIME", "key": "data_file", "value": "/workspace/data_phase2/part_03578_of_04320.hdf", "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1569}}
:::MLLOG {"namespace": "", "time_ms": 1684449399584, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.3959695100784302, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1722, "epoch_num": 150656}}
{'global_steps': 2344, 'eval_loss': 4.141561508178711, 'eval_mlm_accuracy': 0.3959695100784302}

Hello @arjunsuresh, thank you for bringing that up. If it is possible to know which specific image you are using when building the docker image for the pytorch-preview implementation here: https://github.com/mlcommons/training_results_v2.1/blob/158189d4cbfbee366c10da1f0f086c85d8f15b5f/NVIDIA/benchmarks/bert/implementations/pytorch-preview/Dockerfile#L17

It looks like some of the information has been removed from the Dockerfile

arjunsuresh commented 1 year ago

You're wlecome @xihajun Actually I'm using this Dockerfile. I changed this assignment to false.

mlcommons / training_results_v1.1

FMHA error when reproducing DELL BERT benchmark #4

System Used

Error Reproduction