BERT-large unpadded workload "NameError: name 'InitMHACUDAExtension' is not defined"

Trying to run the training for the BERT-large topology, unpadded. We set up an nvidia-docker to run the training workload. However, we run into an error for the unpadded run. Here's an excerpt from the terminal output. The padded workload successfully runs to completion. Padded workload terminal output is below in the comment.

"Namespace(allreduce_post_accumulation=True, allreduce_post_accumulation_fp16=True, bert_config_path='data/uncased_L-24_H-1024_A-16/bert_config.json', bert_model='bert-large-uncased', cache_eval_data=False, checkpoint_activations=False, dense_seq_output=True, disable_apex_softmax=False, disable_fuse_mask=False, disable_fuse_qkv=False, disable_fuse_scale=False, do_train=True, enable_fuse_dropout=True, enable_stream=False, eval_batch_size=128, eval_dir=None, eval_iter_samples=-1, eval_iter_start_samples=3000000, fp16=True, fused_gelu_bias=True, fused_mha=True, gradient_accumulation_steps=1, init_checkpoint='bert_large.pt', init_tf_checkpoint=None, input_dir='./data/hdf5/', keep_n_most_recent_checkpoints=20, learning_rate=0.0004, local_rank=-1, log_freq=1.0, loss_scale=0.0, max_predictions_per_seq=76, max_samples_termination=4500000.0, max_seq_length=512, max_steps=300.0, min_samples_to_start_checkpoints=3000000, n_gpu=1, num_epochs_to_generate_seeds_for=2, num_eval_examples=10000, num_samples_per_checkpoint=500000, opt_lamb_beta_1=0.9, opt_lamb_beta_2=0.999, output_dir='/results', pad=False, phase2=True, resume_from_checkpoint=False, seed=10483, skip_checkpoint=True, target_mlm_accuracy=0.712, train_batch_size=1, train_mlm_accuracy_window_size=0, unpad=True, use_env=False, warmup_proportion=0.0) :::MLLOG {"namespace": "", "time_ms": 1594948688327, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 0.0004, "metadata": {"file": "run_pretraining.py", "lineno": 524}} :::MLLOG {"namespace": "", "time_ms": 1594948688329, "event_type": "POINT_IN_TIME", "key": "opt_epsilon", "value": 1e-06, "metadata": {"file": "run_pretraining.py", "lineno": 529}} :::MLLOG {"namespace": "", "time_ms": 1594948688329, "event_type": "POINT_IN_TIME", "key": "opt_lamb_beta_1", "value": 0.9, "metadata": {"file": "run_pretraining.py", "lineno": 531}} :::MLLOG {"namespace": "", "time_ms": 1594948688329, "event_type": "POINT_IN_TIME", "key": "opt_lamb_beta_2", "value": 0.999, "metadata": {"file": "run_pretraining.py", "lineno": 532}} :::MLLOG {"namespace": "", "time_ms": 1594948688329, "event_type": "POINT_IN_TIME", "key": "opt_lamb_weight_decay_rate", "value": 0.01, "metadata": {"file": "run_pretraining.py", "lineno": 535}} :::MLLOG {"namespace": "", "time_ms": 1594948688330, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_steps", "value": 0, "metadata": {"file": ".../benchmarks/bert/implementations/pytorch/schedulers.py", "lineno": 85}} :::MLLOG {"namespace": "", "time_ms": 1594948688330, "event_type": "POINT_IN_TIME", "key": "opt_lamb_learning_rate_decay_poly_power", "value": 1.0, "metadata": {"file": ".../benchmarks/bert/implementations/pytorch/schedulers.py", "lineno": 86}} :::MLLOG {"namespace": "", "time_ms": 1594948688330, "event_type": "POINT_IN_TIME", "key": "start_warmup_step", "value": 0, "metadata": {"file": "run_pretraining.py", "lineno": 543}} Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.

Defaults for this optimization level are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O2 cast_model_type : torch.float16 patch_torch_functions : False keep_batchnorm_fp32 : True master_weights : True loss_scale : dynamic

Traceback (most recent call last):

File "run_pretraining.py", line 995, in args, final_loss, train_time_raw = main()

File "run_pretraining.py", line 712, in main InitMHACUDAExtension()

NameError: name 'InitMHACUDAExtension' is not defined"

Terminal output for the padded workload that fully completes. :::MLLOG {"namespace": "", "time_ms": 1595939600184, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "run_pretraining.py", "lineno": 671}} device: cuda n_gpu: 1, distributed training: False, 16-bits training: True :::MLLOG {"namespace": "", "time_ms": 1595939600341, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "bert", "metadata": {"file": "/workspace/bert/implementations/pytorch/mlperf_logger.py", "lineno": 68}} :::MLLOG {"namespace": "", "time_ms": 1595939600342, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "NVIDIA", "metadata": {"file": "/workspace/bert/implementations/pytorch/mlperf_logger.py", "lineno": 73}} :::MLLOG {"namespace": "", "time_ms": 1595939600342, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "/workspace/bert/implementations/pytorch/mlperf_logger.py", "lineno": 77}} :::MLLOG {"namespace": "", "time_ms": 1595939600342, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "/workspace/bert/implementations/pytorch/mlperf_logger.py", "lineno": 81}} :::MLLOG {"namespace": "", "time_ms": 1595939600342, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "1xSUBMISSION_PLATFORM_PLACEHOLDER", "metadata": {"file": "/workspace/bert/implementations/pytorch/mlperf_logger.py", "lineno": 85}}

Torch distributed is available.

Torch distributed is not initialized.

:::MLLOG {"namespace": "", "time_ms": 1595939600343, "event_type": "POINT_IN_TIME", "key": "seed", "value": 10483, "metadata": {"file": "run_pretraining.py", "lineno": 690}} :::MLLOG {"namespace": "", "time_ms": 1595939600343, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 8, "metadata": {"file": "run_pretraining.py", "lineno": 692}} :::MLLOG {"namespace": "", "time_ms": 1595939600343, "event_type": "POINT_IN_TIME", "key": "opt_gradient_accumulation_steps", "value": 1, "metadata": {"file": "run_pretraining.py", "lineno": 694}} :::MLLOG {"namespace": "", "time_ms": 1595939600344, "event_type": "POINT_IN_TIME", "key": "max_predictions_per_seq", "value": 76, "metadata": {"file": "run_pretraining.py", "lineno": 696}} :::MLLOG {"namespace": "", "time_ms": 1595939600344, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_training_steps", "value": 300.0, "metadata": {"file": "run_pretraining.py", "lineno": 698}} :::MLLOG {"namespace": "", "time_ms": 1595939600344, "event_type": "POINT_IN_TIME", "key": "num_warmup_steps", "value": 0, "metadata": {"file": "run_pretraining.py", "lineno": 701}} parsed args:

Namespace(allreduce_post_accumulation=True, allreduce_post_accumulation_fp16=True, bert_config_path='data/uncased_L-24_H-1024_A-16/bert_config.json', bert_model='bert-large-uncased', cache_eval_data=False, checkpoint_activations=False, dense_seq_output=True, disable_apex_softmax=False, disable_fuse_mask=False, disable_fuse_qkv=False, disable_fuse_scale=False, do_train=True, enable_fuse_dropout=True, enable_stream=False, eval_batch_size=128, eval_dir=None, eval_iter_samples=-1, eval_iter_start_samples=3000000, fp16=True, fused_gelu_bias=True, fused_mha=True, gradient_accumulation_steps=1, init_checkpoint='bert_large.pt', init_tf_checkpoint=None, input_dir='./data/hdf5/', keep_n_most_recent_checkpoints=20, learning_rate=0.0004, local_rank=-1, log_freq=1.0, loss_scale=0.0, max_predictions_per_seq=76, max_samples_termination=4500000.0, max_seq_length=512, max_steps=300.0, min_samples_to_start_checkpoints=3000000, n_gpu=1, num_epochs_to_generate_seeds_for=2, num_eval_examples=10000, num_samples_per_checkpoint=500000, opt_lamb_beta_1=0.9, opt_lamb_beta_2=0.999, output_dir='/results', pad=True, phase2=True, resume_from_checkpoint=False, seed=10483, skip_checkpoint=True, target_mlm_accuracy=0.712, train_batch_size=8, train_mlm_accuracy_window_size=0, unpad=False, use_env=False, warmup_proportion=0.0)

:::MLLOG {"namespace": "", "time_ms": 1595939612139, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 0.0004, "metadata": {"file": "run_pretraining.py", "lineno": 524}} :::MLLOG {"namespace": "", "time_ms": 1595939612142, "event_type": "POINT_IN_TIME", "key": "opt_epsilon", "value": 1e-06, "metadata": {"file": "run_pretraining.py", "lineno": 529}} :::MLLOG {"namespace": "", "time_ms": 1595939612142, "event_type": "POINT_IN_TIME", "key": "opt_lamb_beta_1", "value": 0.9, "metadata": {"file": "run_pretraining.py", "lineno": 531}} :::MLLOG {"namespace": "", "time_ms": 1595939612143, "event_type": "POINT_IN_TIME", "key": "opt_lamb_beta_2", "value": 0.999, "metadata": {"file": "run_pretraining.py", "lineno": 532}} :::MLLOG {"namespace": "", "time_ms": 1595939612143, "event_type": "POINT_IN_TIME", "key": "opt_lamb_weight_decay_rate", "value": 0.01, "metadata": {"file": "run_pretraining.py", "lineno": 535}} :::MLLOG {"namespace": "", "time_ms": 1595939612143, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_steps", "value": 0, "metadata": {"file": "/workspace/bert/implementations/pytorch/schedulers.py", "lineno": 85}} :::MLLOG {"namespace": "", "time_ms": 1595939612144, "event_type": "POINT_IN_TIME", "key": "opt_lamb_learning_rate_decay_poly_power", "value": 1.0, "metadata": {"file": "/workspace/bert/implementations/pytorch/schedulers.py", "lineno": 86}} :::MLLOG {"namespace": "", "time_ms": 1595939612144, "event_type": "POINT_IN_TIME", "key": "start_warmup_step", "value": 0, "metadata": {"file": "run_pretraining.py", "lineno": 543}} Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.

:::MLLOG {"namespace": "", "time_ms": 1595939612188, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "run_pretraining.py", "lineno": 750}} :::MLLOG {"namespace": "", "time_ms": 1595939612188, "event_type": "INTERVAL_START", "key": "run_start", "value": null, "metadata": {"file": "run_pretraining.py", "lineno": 751}} :::MLLOG {"namespace": "", "time_ms": 1595939612188, "event_type": "INTERVAL_START", "key": "epoch_start", "value": null, "metadata": {"file": "run_pretraining.py", "lineno": 760, "epoch_num": 1}} :::MLLOG {"namespace": "", "time_ms": 1595939612188, "event_type": "INTERVAL_START", "key": "block_start", "value": null, "metadata": {"file": "run_pretraining.py", "lineno": 764, "first_epoch_num": 1, "epoch_count": 1}} parsed args:

Namespace(allreduce_post_accumulation=True, allreduce_post_accumulation_fp16=True, bert_config_path='data/uncased_L-24_H-1024_A-16/bert_config.json', bert_model='bert-large-uncased', cache_eval_data=False, checkpoint_activations=False, dense_seq_output=True, disable_apex_softmax=False, disable_fuse_mask=False, disable_fuse_qkv=False, disable_fuse_scale=False, do_train=True, enable_fuse_dropout=True, enable_stream=False, eval_batch_size=128, eval_dir=None, eval_iter_samples=-1, eval_iter_start_samples=3000000, fp16=True, fused_gelu_bias=True, fused_mha=True, gradient_accumulation_steps=1, init_checkpoint='bert_large.pt', init_tf_checkpoint=None, input_dir='./data/hdf5/', keep_n_most_recent_checkpoints=20, learning_rate=0.0004, local_rank=-1, log_freq=1.0, loss_scale=0.0, max_predictions_per_seq=76, max_samples_termination=4500000.0, max_seq_length=512, max_steps=300.0, min_samples_to_start_checkpoints=3000000, n_gpu=1, num_epochs_to_generate_seeds_for=2, num_eval_examples=10000, num_samples_per_checkpoint=500000, opt_lamb_beta_1=0.9, opt_lamb_beta_2=0.999, output_dir='/results', pad=True, phase2=True, resume_from_checkpoint=False, resume_step=0, seed=10483, skip_checkpoint=True, target_mlm_accuracy=0.712, train_batch_size=8, train_mlm_accuracy_window_size=0, unpad=False, use_env=False, warmup_proportion=0.0)

epoch: 1 {'training_steps': 1, 'average_loss': 4.625, 'step_loss': 4.625, 'learning_rate': 0.0003986666666666667, 'seq/s': 0.4327840547969736, 'global_steps': 0, 'samples_trained': 0, 'skipped_steps': 1, 'timestamp': 1595939630.674186}

{'training_steps': 2, 'average_loss': 5.734375, 'step_loss': 5.734375, 'learning_rate': 0.0003986666666666667, 'seq/s': 33.1319008682275, 'global_steps': 0, 'samples_trained': 0, 'skipped_steps': 2, 'timestamp': 1595939630.9156468}

{'training_steps': 3, 'average_loss': 4.953125, 'step_loss': 4.953125, 'learning_rate': 0.0003986666666666667, 'seq/s': 33.377398279525394, 'global_steps': 0, 'samples_trained': 0, 'skipped_steps': 3, 'timestamp': 1595939631.1553311}

Below is the run command from our shell script that specifies at the end whether to explicitly "pad" or "unpad" the BERT-large training. #nvprof --print-gpu-trace --profile-from-start off --log-file bert.nvlog \ python -u run_pretraining.py \ --train_batch_size=8 \ --learning_rate=4.0e-4 \ --opt_lamb_beta_1=0.9 \ --opt_lamb_beta_2=0.999 \ --warmup_proportion=0.0 \ --max_steps=300 \ --phase2 \ --max_seq_length=512 \ --max_predictions_per_seq=76 \ --input_dir=./data/hdf5/ \ --init_checkpoint=bert_large.pt \ --do_train \ --skip_checkpoint \ --train_mlm_accuracy_window_size=0 \ --target_mlm_accuracy=0.712 \ --max_samples_termination=4500000 \ --output_dir=/results \ --fp16 \ --fused_gelu_bias \ --dense_seq_output \ --fused_mha \ --allreduce_post_accumulation \ --allreduce_post_accumulation_fp16 \ --gradient_accumulation_steps=1 \ --log_freq=1 \ --bert_config_path=data/uncased_L-24_H-1024_A-16/bert_config.json \ --seed=10483 \ --enable_fuse_dropout \ --pad \ --unpad #\

mlcommons / training_results_v0.7

BERT-large unpadded workload "NameError: name 'InitMHACUDAExtension' is not defined" #2