tensorflow / profiler

A profiling and performance analysis tool for TensorFlow
Apache License 2.0
356 stars 55 forks source link

CUPTI_ERROR_INSUFFICIENT_PRIVILEGES in Docker #63

Open johnbensnyder opened 4 years ago

johnbensnyder commented 4 years ago

GPU profiling in Docker requires including the docker run option '--privileged=true'.

Topic is discussed in this issue:

https://github.com/tensorflow/tensorflow/issues/35860

Can Docker setup instructions be included on the profiler setup page?

https://www.tensorflow.org/guide/profiler

ckluk commented 4 years ago

Thanks for the suggestion. We will add the Docker setup instructions to the profiler guide as suggested.

-ck

On Wed, Jun 10, 2020 at 12:35 PM Ben Snyder notifications@github.com wrote:

GPU profiling in Docker requires including the docker run option '--privileged=true'.

Topic is discussed in this issue:

tensorflow/tensorflow#35860 https://github.com/tensorflow/tensorflow/issues/35860

Can Docker setup instructions be included on the profiler setup page?

https://www.tensorflow.org/guide/profiler

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/profiler/issues/63, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE33L3JXM3SKI4DX2FXL35LRV7OADANCNFSM4N2V4RYA .

d-miketa commented 4 years ago

Good to hear, @ckluk, thank you! If you're open to taking requests, I'd be very interested in a Docker setup in which:

a) GPU profiling works b) the container is run as a normal user (so that all newly created files, eg logs and saved models, are owned by the user, not root)

but I can't get both to work at the same time. I have the following in (the host machine's) /etc/modprobe.d/nvidia-kernel-common.conf: options nvidia "NVreg_RestrictProfilingToAdminUsers=0" and I ran update-initramfs -u after adding it (and rebooted afterwards).

The Docker container is created by docker run -it --gpus=all --rm --user "$(id -u):$(id -g)" dom/tensorflow:2.2.0-gpu (plus some volume binds etc). Unfortunately, this setup leads to CUPTI_ERROR_INSUFFICIENT_PRIVILEGES.

ckluk commented 4 years ago

Hi Dom,

I don't think we can do much from the Profiler end, as the privilege requirement is from CUPTI. In the future (probably at the timeframe of TF 2.4 release), TF will use CUDA 11. My understanding is that we shouldn't have this CUPTI privilege requirement with CUDA 11.

Thanks, -ck

On Fri, Jun 12, 2020 at 2:17 PM Dom Miketa notifications@github.com wrote:

Good to hear, @ckluk https://github.com/ckluk, thank you! If you're open to taking requests, I'd be very interested in a Docker setup in which:

a) GPU profiling works b) the container is run as a normal user (so that all newly created files, eg logs and saved models, are owned by the user, not root)

but I can't get both to work at the same time. I have the following in (the host machine's) /etc/modprobe.d/nvidia-kernel-common.conf: options nvidia "NVreg_RestrictProfilingToAdminUsers=0" and I ran update-initramfs -u after adding it (and rebooted afterwards).

The Docker container is created by docker run -it --gpus=all --rm --user "$(id -u):$(id -g)" dom/tensorflow:2.2.0-gpu (plus some volume binds etc). Unfortunately, this setup leads to CUPTI_ERROR_INSUFFICIENT_PRIVILEGES.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/profiler/issues/63#issuecomment-643487052, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE33L3OHJ22TPYJ3MEXQDO3RWKLQLANCNFSM4N2V4RYA .

d-miketa commented 4 years ago

Thanks @ckluk. I was hoping it's a matter of bad setup, but it's good to hear it'll at least eventually get resolved.

adampl commented 4 years ago

@d-miketa Instead of running the container with --privileged=true, try --cap-add=CAP_SYS_ADMIN

More info: https://developer.nvidia.com/nvidia-development-tools-solutions-err-nvgpuctrperm-cupti

d-miketa commented 4 years ago

I ended up doing the following, some subset of which seems to have done the trick:

It's possible that --cap-add=CAP_SYS_ADMIN would work as well as --privileged, but I haven't tried.

dhiren-hamal commented 3 years ago

I ended up doing the following, some subset of which seems to have done the trick:

  • updating host machine to Ubuntu 20.04
  • adding options nvidia "NVreg_RestrictProfilingToAdminUsers=0" to /etc/modprobe.d/nvidia-kernel-common.conf and running update-initramfs -u
  • adding export CUDA_VERSION="10.1", export LD_LIBRARY_PATH="/usr/local/cuda-${CUDA_VERSION}/lib64:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/lib64 and export LD_INCLUDE_PATH="/usr/local/cuda-${CUDA_VERSION}/include:/usr/local/cuda-${CUDA_VERSION}/extras/CUPTI/include" to the host machine's .zshrc
  • adding ENV LD_INCLUDE_PATH="/usr/local/cuda/include:/usr/local/cuda/extras/CUPTI/include:$LD_INCLUDE_PATH to the Dockerfile
  • running the Docker container with --privileged

It's possible that --cap-add=CAP_SYS_ADMIN would work as well as --privileged, but I haven't tried.

Hi! how to pass those parameters into Docker container? I did as follows but got error nvidia-docker run -d -it --name retina_net -v /home/readib/Experiments/:/ -p 8000:8888 -v /tmp/.X11-unix/:/tmp/.X11-unix -e DISPLAY=$DISPLAY retina_net:latest --cap-add=CAP_SYS_ADMIN /bin/bash

Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "--cap-add=CAP_SYS_ADMIN": executable file not found in $PATH: unknown

Thank you.

dhiren-hamal commented 3 years ago

In order to run docker: nvidia-docker run '--privileged=true' -d -it --name retina_net -v /home/readib/Experiments/:/home -p 8000:8888 -v /tmp/.X11-unix/:/tmp/.X11-unix -e DISPLAY=$DISPLAY retina_net:latest /bin/bash