Open elimkwan opened 2 years ago
I believe the issue is you need to pack your input sequences together. ie instead of your input looking like (batch_size, max_seq_len, ...), it should be (batch_size * seq_len, ...), with seq_len variable. So you remove all the 0 padding tokens and concatenate all your sequences together. Then you can pass in the seqlens through cu_seqlens so the MHA knows where each sequence starts and stops.
See qkv and qkv_vs: https://github.com/NVIDIA/apex/blob/master/apex/contrib/test/fmha/test_fmha.py
Hi Eric, thanks for the prompt reply but I believe the issue hasn't been resolved. The input has already been densely packed as
qkv.view(-1, 3, self.h, self.d)
I have checked the first dimension of the input, it is equal to the batch_size * seq_len, and unfortunately, the issue persists. I believe the dprops
in the error message refer to the device's major compute capability and minor compute capability must be 8 and 0 respectively. According to Apex source code:
auto dprops = at::cuda::getCurrentDeviceProperties();
TORCH_CHECK(dprops->major == 8 && dprops->minor == 0);
There are talks on it about relaxing the check but it didn't go through. I have tried disabling the check and rebuilding the package from source, however, it seems to me that the package only support sm80 as shown in their src folder. This is the error I encountered after disabling the dprops check manually:
CUDA error (apex/contrib/csrc/fmha/src/fmha_dgrad_fp16_512_64_kernel.sm80.cu:97): invalid argument
The error is coming from a kernel used in FMHA module in Apex. That code in particular is intended to run on GA100 architecture, whereas A5000 cards use GA102 architecture. GA102 has lower amount of shared memory than GA100 and the amount of shared memory required by that kernel. That's why you are getting the last error you posted, where the code requests a maximum size in bytes of dynamically allocated shared memory that is not supported by the architecture you are running on.
You can disable the usage of Apex FMHA module in the code by removing --pad_fmha or --unpad_fmha (whichever you currently have as an argument). Then, a different code path will be used for multi-head attention, which is not as performant but should run fine on your card.
With this change, you will also need to comment out this section of the code (or place it under a guard like if args.pad_fmha or args.unpad_fmha:
).
Please let us know if you have more issues running the code.
Many thanks for the reply @seryilmaz , your advice was useful in resolving the error mentioned above. However, another error emerges after that, do you have any insight on this (CUDA error: invalid configuration argument
) ?
Error Log:
:::MLLOG {"namespace": "", "time_ms": 1652282257010, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1370}}
:::MLLOG {"namespace": "", "time_ms": 1652282257026, "event_type": "INTERVAL_START", "key": "run_start", "value": null, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1371}}
:::MLLOG {"namespace": "", "time_ms": 1652282257108, "event_type": "INTERVAL_START", "key": "epoch_start", "value": null, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1382, "epoch_num": 1}}
:::MLLOG {"namespace": "", "time_ms": 1652282257109, "event_type": "INTERVAL_START", "key": "block_start", "value": null, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1384, "first_epoch_num": 1, "epoch_count": 1}}
parsed args:
Namespace(allreduce_post_accumulation=True, allreduce_post_accumulation_fp16=True, bert_config_path='/workspace/phase1/bert_config.json', bert_model='bert-large-uncased', bypass_amp=False, cache_eval_data=True, checkpoint_activations=False, cuda_graph_mode='segmented', ddp_type='apex', dense_seq_output=True, device=device(type='cuda', index=0), disable_apex_softmax=False, disable_fuse_mask=False, disable_fuse_qkv=False, disable_fuse_scale=False, distributed_lamb=True, do_train=True, dwu_e5m2_allgather=False, dwu_group_size=0, dwu_num_ag_pg=1, dwu_num_ar_pg=1, dwu_num_blocks=1, dwu_num_chunks=1, dwu_num_rs_pg=1, dwu_overlap_reductions=False, enable_fuse_dropout=False, enable_stream=False, eval_batch_size=16, eval_dir='/workspace/evaldata', eval_iter_samples=150000, eval_iter_start_samples=150000, exchange_padding=True, fp16=True, fused_bias_fc=True, fused_bias_mha=True, fused_dropout_add=True, fused_gelu_bias=False, fused_mha=False, gradient_accumulation_steps=1, init_checkpoint='/workspace/phase1/model.ckpt-28252.pt', init_tf_checkpoint=None, input_dir='/workspace/data_phase2', keep_n_most_recent_checkpoints=20, learning_rate=0.00035, local_rank=0, log_freq=0.0, loss_scale=0.0, max_iterations_per_graph=4, max_predictions_per_seq=76, max_samples_termination=4500000.0, max_seq_length=512, max_steps=7100.0, min_samples_to_start_checkpoints=3000000, n_gpu=2, num_epochs_to_generate_seeds_for=2, num_eval_examples=10000, num_samples_per_checkpoint=500000, opt_lamb_beta_1=0.9, opt_lamb_beta_2=0.999, output_dir='/results', pad=False, pad_fmha=False, phase2=True, resume_from_checkpoint=False, resume_step=0, seed=7023, skip_checkpoint=True, start_warmup_step=0.0, target_mlm_accuracy=0.72, train_batch_size=16, train_mlm_accuracy_window_size=0, unpad=True, unpad_fmha=False, use_cuda_graph=False, use_ddp=False, use_env=False, use_gradient_as_bucket_view=False, warmup_proportion=0.0, warmup_steps=0.0, weight_decay_rate=0.01)
epoch: 1
/workspace/bert/model/layers/activations.py:98: UserWarning: FALLBACK path has been taken. This is an indication that codegenFailed for some reason. To debug try disable codegen fallback pathvia setting the env variable`export PYTORCH_NVFUSER_DISABLE_FALLBACK=1` (Triggered internally at /opt/pytorch/pytorch/torch/csrc/jit/codegen/cuda/manager.cpp:305.)
return gelu_fwd(input)
/workspace/bert/model/layers/activations.py:98: UserWarning: FALLBACK path has been taken. This is an indication that codegenFailed for some reason. To debug try disable codegen fallback pathvia setting the env variable`export PYTORCH_NVFUSER_DISABLE_FALLBACK=1` (Triggered internally at /opt/pytorch/pytorch/torch/csrc/jit/codegen/cuda/manager.cpp:305.)
return gelu_fwd(input)
Traceback (most recent call last):
File "/workspace/bert/run_pretraining.py", line 1744, in <module>
args, final_loss, train_time_raw = main()
File "/workspace/bert/run_pretraining.py", line 1511, in main
loss, mlm_acc, sbridge = fwd_loss_bwd_trainer.step(step,
File "/workspace/bert/fwd_loss_bwd_trainer.py", line 149, in step
loss, mlm_acc, _ = model(*batch)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 1152, in forward
return self.heads_only_segment(sequence_output, pooled_output, masked_lm_labels, next_sentence_label)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 1107, in forward
prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output, masked_lm_labels)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 792, in forward
prediction_scores = self.predictions(sequence_output)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 753, in forward
hidden_states = self.transform(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 735, in forward
hidden_states = self.LayerNorm(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/apex/contrib/layer_norm/layer_norm.py", line 44, in forward
return FastLayerNormFN.apply(x, self.weight, self.bias, self.epsilon)
File "/opt/conda/lib/python3.8/site-packages/apex/contrib/layer_norm/layer_norm.py", line 14, in forward
ymat, mu, rsigma = fast_layer_norm.ln_fwd(xmat, gamma, beta, epsilon)
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "/workspace/bert/run_pretraining.py", line 1744, in <module>
args, final_loss, train_time_raw = main()
File "/workspace/bert/run_pretraining.py", line 1511, in main
loss, mlm_acc, sbridge = fwd_loss_bwd_trainer.step(step,
File "/workspace/bert/fwd_loss_bwd_trainer.py", line 149, in step
loss, mlm_acc, _ = model(*batch)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 1152, in forward
return self.heads_only_segment(sequence_output, pooled_output, masked_lm_labels, next_sentence_label)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 1107, in forward
prediction_scores, seq_relationship_score = self.cls(sequence_output, pooled_output, masked_lm_labels)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 792, in forward
prediction_scores = self.predictions(sequence_output)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 753, in forward
hidden_states = self.transform(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/bert/modeling.py", line 735, in forward
hidden_states = self.LayerNorm(hidden_states)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/apex/contrib/layer_norm/layer_norm.py", line 44, in forward
return FastLayerNormFN.apply(x, self.weight, self.bias, self.epsilon)
File "/opt/conda/lib/python3.8/site-packages/apex/contrib/layer_norm/layer_norm.py", line 14, in forward
ymat, mu, rsigma = fast_layer_norm.ln_fwd(xmat, gamma, beta, epsilon)
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ENDING TIMING RUN AT 2022-05-11 03:17:39 PM
RESULT,bert,7023,17,,2022-05-11 03:17:22 PM
Just want to put a side note. The reproducibility only applied with the same HW. You have different configurations ex. GPU: NVIDIA RTX A5000 x2 vs A100x4. I don't think A5000 is really supported by the code or ever verified by anyone.
I am not able to reproduce this error with the container used in v1.1. @elimkwan can you try running with CUDA_LAUNCH_BLOCKING=1 to have a better idea about where the error might be coming from? As a side note, I would use a smaller batch size, batch 64 might be too large for A5000 which only has 24 GB. I recommend starting with batch 16.
I was trying bert implementation on RTX 4090 and after following the suggestions here I'm able to get a run going but the accuracy of the model is not improving :) Is it possible to train bert on a single RTX 4090?
:::MLLOG {"namespace": "", "time_ms": 1684433934190, "event_type": "POINT_IN_TIME", "key": "data_file", "value": "/workspace/data_phase2/part_04253_of_04320.hdf", "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1569}}
:::MLLOG {"namespace": "", "time_ms": 1684436040878, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.34087124466896057, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1722, "epoch_num": 150848}}
{'global_steps': 2344, 'eval_loss': 4.732451915740967, 'eval_mlm_accuracy': 0.34087124466896057}
:::MLLOG {"namespace": "", "time_ms": 1684438136054, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.34087124466896057, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1722, "epoch_num": 301056}}
{'global_steps': 4688, 'eval_loss': 4.732451915740967, 'eval_mlm_accuracy': 0.34087124466896057}
This suggestion fixed the issue for us. We are seeing sensible output now
:::MLLOG {"namespace": "", "time_ms": 1684447292996, "event_type": "POINT_IN_TIME", "key": "data_file", "value": "/workspace/data_phase2/part_03578_of_04320.hdf", "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1569}}
:::MLLOG {"namespace": "", "time_ms": 1684449399584, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.3959695100784302, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1722, "epoch_num": 150656}}
{'global_steps': 2344, 'eval_loss': 4.141561508178711, 'eval_mlm_accuracy': 0.3959695100784302}
This suggestion fixed the issue for us. We are seeing sensible output now
:::MLLOG {"namespace": "", "time_ms": 1684447292996, "event_type": "POINT_IN_TIME", "key": "data_file", "value": "/workspace/data_phase2/part_03578_of_04320.hdf", "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1569}} :::MLLOG {"namespace": "", "time_ms": 1684449399584, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 0.3959695100784302, "metadata": {"file": "/workspace/bert/run_pretraining.py", "lineno": 1722, "epoch_num": 150656}} {'global_steps': 2344, 'eval_loss': 4.141561508178711, 'eval_mlm_accuracy': 0.3959695100784302}
Hello @arjunsuresh, thank you for bringing that up. If it is possible to know which specific image you are using when building the docker image for the pytorch-preview
implementation here: https://github.com/mlcommons/training_results_v2.1/blob/158189d4cbfbee366c10da1f0f086c85d8f15b5f/NVIDIA/benchmarks/bert/implementations/pytorch-preview/Dockerfile#L17
It looks like some of the information has been removed from the Dockerfile
You're wlecome @xihajun Actually I'm using this Dockerfile. I changed this assignment to false
.
We tried to follow the Dell example to reproduce the Bert Training Benchmark on a server with 2 GPUs. We have encountered an error when running the model encoder layer, and it is related to the
fmhalib.fwd
function:Expected dprops->major == 8 && dprops->minor == 0 to be true, but got false
.The error happens in the last line:
It seems to be related to the error mentioned here, but I am not entirely sure about how to apply their fix (unpad the qkv).
System Used
CPU
GPU
System:
Error Reproduction
For reproducing the error, the following settings were used. We created two config files (
config_SUT.sh
,config_SUT_common.sh
) and ran the code interactively within a docker container.Configs in
config_SUT.sh
Configs in
config_SUT_common.sh
After creating the docker image
mlperf-nvidia:language_model
, enter the docker container with the following command:Running the program:
Error Log: