src-d / coreos-nvidia

Yet another NVIDIA driver container for Container Linux (aka CoreOS)
GNU General Public License v3.0
37 stars 15 forks source link

nvidia-docker v2? #1

Open thomas-riccardi opened 6 years ago

thomas-riccardi commented 6 years ago

Hi, Are there plans to use nvidia-docker v2 (now merged into master: new official version) ?

It is simpler to use: https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0

rporres commented 6 years ago

Above links are broken. I guess it's because 2.0 branch was merged into master recently by means of https://github.com/NVIDIA/nvidia-docker/commit/fe1874942b896df074ca1b5b819bc6a2ca9e8151

thomas-riccardi commented 6 years ago

@rporres indeed, I updated my comment.

mcuadros commented 6 years ago

Its requires any changes? The current version was done for bare docker, not even nvidia-docker 1.0

thomas-riccardi commented 6 years ago

using nvidia-docker v2 would simplify the docker run part: no need to add:

--volumes-from nvidia-driver \
    --env PATH=$PATH:/opt/nvidia/bin/ \
    --env LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nvidia/lib \
    $(for d in /dev/nvidia*; do echo -n "--device $d "; done) \

So what change is required is in fact installing nvidia-docker v2 in coreos, and removing the nvidia-driver container.

trevex commented 6 years ago

I used the following steps to install nvidia-docker v2 (very hacky though):

  1. install nvidia driver
  2. instead of the volume I simply copy the files to the host, e.g.
    /usr/bin/docker run --rm --volume /opt/nvidia/current:/output srcd/coreos-nvidia:${VERSION} cp -a /opt/nvidia/. /output/
  3. install libnvidia-container
  4. (build and) install nvidia-container-runtime
  5. create small bash scripts in /run/torcx/bin for nvidia-container{-runtime,-runtime-hook,-cli} to make sure they are accessible by docker and libraries are in LD_LIBRARY_PATH
  6. create /etc/docker/daemon.json and set default runtime to nvidia
  7. restart docker
  8. add the nvidia-docker bash scripts

There is only one issue currently: The nvidia-container-runtime somehow (even though same commit as installed runc) has a regression. And fails to run containers with docker run --security-opt=no-new-privileges (https://github.com/coreos/bugs/issues/1796).

lsjostro commented 6 years ago

We have it working as well (nvidia-docker v2 + coreos + k8s device plugin). We will try to clean it up and hopefully be able to share it soonish.

lsjostro commented 6 years ago

went for this instead https://github.com/GoogleCloudPlatform/container-engine-accelerators/pull/54

thomas-riccardi commented 6 years ago

@lsjostro I would be interested in having your previous "nvidia-docker v2 + coreos" version, even if not cleaned up and production-ready: nvidia-docker v2 enables sharing GPUs between containers (at the cost of losing k8s scheduling) that device drivers solutions don't support (and won't for the foreseeable future).

In any case, https://github.com/GoogleCloudPlatform/container-engine-accelerators/pull/54 is useful too, thanks for that !