mlcommons / training_results_v0.5

This repository contains the results and code for the MLPerf™ Training v0.5 benchmark.
https://mlcommons.org/en/training-normal-05/
Apache License 2.0
35 stars 54 forks source link

[single_stage_detector]RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input. (nhwc_bn_fwd_train_cudnn_impl at /tmp/pip-req-build-pq90lolk/csrc/nhwc/batch_norm.cu:109) #5

Closed blackdan0716 closed 5 years ago

blackdan0716 commented 5 years ago

When I perform single_stage_detector benchmark, I got failed message as below:

:::MLPv0.5.0 ssd 1548297287.858427048 (/workspace/single_stage_detector/ssd300.py:69) num_defaults_per_cell: [4, 6, 6, 6, 4, 4] Traceback (most recent call last): File "train.py", line 710, in main() File "train.py", line 703, in main success = train300_mlperf_coco(args) File "train.py", line 513, in train300_mlperf_coco ssd300.module = torch.jit.trace(module_to_jit, example_input) File "/opt/conda/lib/python3.6/site-packages/torch/jit/init.py", line 565, in trace module._create_method_from_trace('forward', func, example_inputs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call result = self._slow_forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call result = self._slow_forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward result = self.forward(*input, kwargs) File "/workspace/single_stage_detector/ssd300.py", line 184, in forward layers = self.model(data) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call result = self._slow_forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward result = self.forward(input, kwargs) File "/workspace/single_stage_detector/base_model.py", line 99, in forward layer1_activation = self.layer1(data) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call result = self._slow_forward(*input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward result = self.forward(*input, *kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call result = self._slow_forward(input, kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward result = self.forward(*input, kwargs) File "/workspace/single_stage_detector/nhwc/batch_norm.py", line 73, in forward self.eps, self.fuse_relu, self.training, z) File "/workspace/single_stage_detector/nhwc/batch_norm.py", line 36, in forward y, save_mean, save_var, reserve = C.bn_fwd_nhwc_cudnn(x, s, b, rm, riv, mom, epsilon, fuse_relu) RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input. (nhwc_bn_fwd_train_cudnn_impl at /tmp/pip-req-build-pq90lolk/csrc/nhwc/batch_norm.cu:109)**

Is any dataset or model file lack for this issue? I had prepared dataset in /data/coco2017/ but not related model file include. Thanks.

blackdan0716 commented 5 years ago

I used wrong GPU (P100) for this benchmark. When I changed to V100, problem solved