mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 548 forks source link

docker run error for image_segmentation/pytorch test following the guide #689

Open gaowayne opened 7 months ago

gaowayne commented 7 months ago

the guide link is image_segmentation/pytorch

when I try to run the container, I got below error, mention the runtime nvidia does not exist. could you please shed some light?

[stg@oq1 pytorch]$ sudo docker run --ipc=host -it --rm --runtime=nvidia -v /mnt/pytorch/mlperf/1/training/image_segmentation/pytorch/raw-data-dir:/raw_data -v /mnt/pytorch/mlperf/1/training/image_segmentation/pytorch/data:/data -v /mnt/pytorch/mlperf/1/training/image_segmentation/pytorch/results:/results unet3d:latest /bin/bash
[sudo] password for stg: 
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.
[stg@oq1 pytorch]$ 

I am using FedoraOS37, I failed to install cuda container support because this scripts does not support FedoraOS

[stg@oq1 training]$ sudo sh install_cuda_docker.sh 
[sudo] password for stg: 
--2023-11-23 19:31:16--  https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190 [application/octet-stream]
Saving to: ‘cuda-ubuntu2004.pin’

cuda-ubuntu2004.pin                                   100%[======================================================================================================================>]     190  --.-KB/s    in 0s      

2023-11-23 19:31:16 (10.6 MB/s) - ‘cuda-ubuntu2004.pin’ saved [190/190]

mv: cannot move 'cuda-ubuntu2004.pin' to '/etc/apt/preferences.d/cuda-repository-pin-600': No such file or directory
--2023-11-23 19:31:16--  https://developer.download.nvidia.com/compute/cuda/11.6.0/local_installers/cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.195.19.142
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.195.19.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2681112370 (2.5G) [application/x-deb]
Saving to: ‘cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb’

cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_am 100%[======================================================================================================================>]   2.50G  21.1MB/s    in 2m 36s  

2023-11-23 19:33:53 (16.4 MB/s) - ‘cuda-repo-ubuntu2004-11-6-local_11.6.0-510.39.01-1_amd64.deb’ saved [2681112370/2681112370]

sudo: dpkg: command not found
sudo: apt-key: command not found
sudo: apt-get: command not found
sudo: apt-get: command not found
sudo: apt-get: command not found
gpg: can't create '/usr/share/keyrings/docker-archive-keyring.gpg': No such file or directory
gpg: no valid OpenPGP data found.
gpg: dearmoring failed: No such file or directory
curl: (23) Failed writing body
install_cuda_docker.sh: line 15: dpkg: command not found
install_cuda_docker.sh: line 15: lsb_release: command not found
tee: /etc/apt/sources.list.d/docker.list: No such file or directory
sudo: apt: command not found
sudo: apt-get: command not found
gaowayne commented 7 months ago

guys, I installed nvidia docker in fedora, now I can start container, but when I run next step it shows me error like below. how to fix this?

root@6ec7b9c99e06:/# ls
bin  boot  data  dev  etc  home  lib  lib64  media  mnt  opt  proc  raw_data  results  root  run  sbin  srv  sys  tmp  usr  var  workspace
root@6ec7b9c99e06:/# cd workspace/unet3d/
root@6ec7b9c99e06:/workspace/unet3d# python3 preprocess_dataset.py --data_dir /raw_data --results_dir /data
Preprocessing /raw_data
/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/opt/conda/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
Mean value: nan, std: nan, d: nan, h: nan, w: nan
Traceback (most recent call last):
  File "preprocess_dataset.py", line 147, in <module>
    verify_dataset(args.results_dir)
  File "preprocess_dataset.py", line 127, in verify_dataset
    assert len(source) == len(os.listdir(results_dir))
AssertionError
root@6ec7b9c99e06:/workspace/unet3d# 
gaowayne commented 6 months ago

guys, I install host OS with Ubuntun22.04, I still see this error, could you please shed some light?

dcg@oq1:/mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch$ sudo docker run --ipc=host -it --rm --runtime=nvidia -v /mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch/raw-data-dir:/raw_data -v /mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch/data:/data -v /mnt/nvme1n1/mlperf/ubuntu/training/image_segmentation/pytorch/results:/results unet3d:latest /bin/bash
root@7f2d8fc3d617:/workspace/unet3d# ls
Dockerfile  LICENCE  README.md  checksum.json  data_loading  evaluation_cases.txt  main.py  model  oldREADME.md  preprocess_dataset.py  requirements.txt  run_and_time.sh  runtime
root@7f2d8fc3d617:/workspace/unet3d# python3 preprocess_dataset.py --data_dir /raw_data --results_dir /data
Preprocessing /raw_data
/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/opt/conda/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
Mean value: nan, std: nan, d: nan, h: nan, w: nan
Traceback (most recent call last):
  File "preprocess_dataset.py", line 147, in <module>
    verify_dataset(args.results_dir)
  File "preprocess_dataset.py", line 127, in verify_dataset
    assert len(source) == len(os.listdir(results_dir))
AssertionError
root@7f2d8fc3d617:/workspace/unet3d#