mlcommons / inference_results_v0.5

This repository contains the results and code for the MLPerf™ Inference v0.5 benchmark.
https://mlcommons.org/en/inference-datacenter-05/
Apache License 2.0
55 stars 43 forks source link

calibration error #38

Closed alanshao023 closed 4 years ago

alanshao023 commented 4 years ago

The command I input in a container: make calibrate RUN_ARGS="--benchmarks=resnet"

The output: Traceback (most recent call last): File "code/main.py", line 327, in main() File "code/main.py", line 286, in main config_files = find_config_files(benchmarks, scenarios) File "/work/code/common/init.py", line 123, in find_config_files system = get_system_id() File "/work/code/common/init.py", line 102, in get_system_id import pycuda.driver File "/usr/local/lib/python3.6/dist-packages/pycuda/driver.py", line 5, in from pycuda._driver import * # noqa ImportError: libcuda.so.1: cannot open shared object file: No such file or directory Makefile:309: recipe for target 'calibrate' failed make: *** [calibrate] Error 1

In this container after I input nvidia-smi, the output is "command not found".

What I have finished so far 1) Build Docker Image 2) Build Source Codes (in the docker I generated and same for below) 3) Download and Preprocess Datasets 4) Download Benchmark Models

I'm using CentOS 7.8, Driver Version: 440.64.00, CUDA Version: 10.2, four NV T4 cards.

I'm new to docker, any suggestion is appreciated.

nvpohanh commented 4 years ago

How did you run the docker container? Did you use nvidia-docker run ... or docker run --gpus=all ...?

alanshao023 commented 4 years ago

I followed the readme description as below,

docker run -dt -e NVIDIA_VISIBLE_DEVICES=ALL -w /work \ --security-opt apparmor=unconfined --security-opt seccomp=unconfined \ -v $HOME:/mnt$HOME \ --name mlperf-inference- mlperf-inference:-latest

nvpohanh commented 4 years ago

Could you try adding --gpus=all flag?

alanshao023 commented 4 years ago

Sure. After I did that, in the new container, 1) I ran nvidia-smi and the output was correct; 2) I ran ls /usr/local/cuda/include | grep cuda.h and the output is

cuda.h.

Do it mean the nvidia docker working correctly?

nvpohanh commented 4 years ago

yes, I think so. Could you try the make calibrate ... command again to see if it works? Thanks

alanshao023 commented 4 years ago

yes, I think so. Could you try the make calibrate ... command again to see if it works? Thanks

I have repeated the previous steps below in the new container,

1) Build Source Codes (in the docker I generated and the same for below) 2) Download and Preprocess Datasets 3) Download Benchmark Models

It worked. After these, I was able to run calibration, generate TensorRT engines and run the harness.

Thank you, nvpohanh.