opensciencegrid / osgvo-tensorflow-gpu

OSGVO's TensorFlow image, GPU flavor
3 stars 9 forks source link

Singularity --nv support? #9

Open khurtado opened 4 years ago

khurtado commented 4 years ago

Hi,

ND has recently bought a GPU cluster and we are looking forward to use it both locally and also share it with the OSG.

It seems this image should be able to handle submissions via e.g.: OSG Connect, which will end up using the singularity wrapper that OSG Factory pilots run.

Is it possible to use it locally with bare singularity too? The idea is that users could submit workflows either locally or through the grid, using the same kind of containers.

I tried using singularity exec --nv to see if that would replace the library linking magic the singularity wrapper in the factory does, but that didn't work with this image. It did work with a tensorflow image the NOVA experiments has though (we are not part of NOVA and don't know what their maintenance plan is though)

Here are the outputs for comparison. Any suggestions? Any chance of making this image compatible with --nv?

OSG Tensorflow output

 singularity exec --nv /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:latest python -c "import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))"
WARNING: container does not have /.singularity.d/actions/exec, calling python directly
2020-03-18 14:14:08.446525: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-03-18 14:14:08.456150: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUresult(803)
2020-03-18 14:14:08.456567: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: qa-v100-011.crc.nd.edu
2020-03-18 14:14:08.456707: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: qa-v100-011.crc.nd.edu
2020-03-18 14:14:08.456879: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 440.33.1
2020-03-18 14:14:08.457068: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 440.36.0
2020-03-18 14:14:08.457195: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:308] kernel version 440.36.0 does not match DSO version 440.33.1 -- cannot find working devices in this configuration
Device mapping: no known devices.
2020-03-18 14:14:08.459881: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:

NOVA output:

singularity exec --nv /cvmfs/singularity.opensciencegrid.org/novaexperiment/el7-tensorflow-gpu:latest python -c "import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))"
2020-03-18 14:19:10.301568: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-03-18 14:19:10.455478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:06:00.0
totalMemory: 31.75GiB freeMemory: 31.44GiB
2020-03-18 14:19:10.599308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:2f:00.0
totalMemory: 31.75GiB freeMemory: 31.44GiB
2020-03-18 14:19:10.751820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:86:00.0
totalMemory: 31.75GiB freeMemory: 31.44GiB
2020-03-18 14:19:10.889939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:d8:00.0
totalMemory: 31.75GiB freeMemory: 31.44GiB
2020-03-18 14:19:10.899795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2020-03-18 14:19:12.568067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-18 14:19:12.568200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2 3
2020-03-18 14:19:12.568310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y Y Y
2020-03-18 14:19:12.568405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N Y Y
2020-03-18 14:19:12.568498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   Y Y N Y
2020-03-18 14:19:12.568587: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3:   Y Y Y N
2020-03-18 14:19:12.568912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30503 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0)
2020-03-18 14:19:12.570045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30503 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:2f:00.0, compute capability: 7.0)
2020-03-18 14:19:12.570814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 30503 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0)
2020-03-18 14:19:12.571148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 30503 MB memory) -> physical GPU (device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:d8:00.0, compute capability: 7.0)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:2f:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: Tesla V100-PCIE-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:d8:00.0, compute capability: 7.0
2020-03-18 14:19:12.574249: I tensorflow/core/common_runtime/direct_session.cc:307] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:2f:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: Tesla V100-PCIE-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:d8:00.0, compute capability: 7.0
efajardo commented 4 years ago

Well the error is pretty clear:

2020-03-18 14:14:08.456879: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 440.33.1
2020-03-18 14:14:08.457068: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 440.36.0
2020-03-18 14:14:08.457195: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:308] kernel version 440.36.0 does not match DSO version 440.33.1 -- cannot find working devices in this configuration
Device mapping: no known devices.

In theory this case should work cause 440.36.0 is bigger than 440.33.1 and backwards compatibility should do the work. @rynge shoudl be enoug to rbuild the container? Or should we move to CUDA

rynge commented 4 years ago

Good question. I will try rebuilding the image.

khurtado commented 4 years ago

Thanks! Unrelated question: Does this or another container supported by OSG include pytorch?

khurtado commented 4 years ago

@rynge @efajardo : Just to update on this. It seems rebuilding the image did the trick. This is working now. Thanks!

On a side note, I have a question if you don't mind: I was looking at the TF image maintained by NOVA. https://github.com/CN-Healthborn/el7-tensorflow-gpu/blob/master/Dockerfile

It seems similar to this one, but

1) It uses a CentOs7 version supported by NVIDIA that I don't think existed in the past. This is cool because you can then easily use software available in oasis (gfal, xrootd, etc) 2) They don't seem to install the cuda-drivers. I'm not sure if this means they rely on the --nv feature instead to locate and bind the basic CUDA libraries from the host into the container, so that they are available to the container, and match the kernel GPU driver on the host.

Is there any technical issue not to use the CentOs7 version or the --nv and/or --rocm feature in the Singularity wrapper used by the factories? E.g.: Do Sites or HPC facilities still use singularity versions that do not include these features? Just asking because it seems that would be less work in maintaining the image at long-term. Maybe I'm misunderstanding something though, so feel free to correct me on anything I might be wrong :)

Output now:

singularity exec --nv /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:latest python -c "import tensorflow as tf; sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))"

WARNING: container does not have /.singularity.d/actions/exec, calling python directly
2020-03-20 10:44:19.405009: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-03-20 10:44:19.688234: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:06:00.0
totalMemory: 31.75GiB freeMemory: 31.44GiB
2020-03-20 10:44:19.840247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties:
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:2f:00.0
totalMemory: 31.75GiB freeMemory: 31.44GiB
2020-03-20 10:44:20.004394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 2 with properties:
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:86:00.0
totalMemory: 31.75GiB freeMemory: 31.44GiB
2020-03-20 10:44:20.143910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 3 with properties:
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:d8:00.0
totalMemory: 31.75GiB freeMemory: 31.44GiB
2020-03-20 10:44:20.153507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 2, 3
2020-03-20 10:44:43.618985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-20 10:44:43.619839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1 2 3
2020-03-20 10:44:43.619948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y Y Y
2020-03-20 10:44:43.620040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N Y Y
2020-03-20 10:44:43.620117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2:   Y Y N Y
2020-03-20 10:44:43.620207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3:   Y Y Y N
2020-03-20 10:44:43.620757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30501 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0)
2020-03-20 10:44:43.904673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30501 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:2f:00.0, compute capability: 7.0)
2020-03-20 10:44:44.188372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 30501 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0)
2020-03-20 10:44:44.468798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 30501 MB memory) -> physical GPU (device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:d8:00.0, compute capability: 7.0)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:2f:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: Tesla V100-PCIE-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:d8:00.0, compute capability: 7.0
2020-03-20 10:44:44.759556: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:06:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla V100-PCIE-32GB, pci bus id: 0000:2f:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: Tesla V100-PCIE-32GB, pci bus id: 0000:86:00.0, compute capability: 7.0
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: Tesla V100-PCIE-32GB, pci bus id: 0000:d8:00.0, compute capability: 7.0
rynge commented 4 years ago

The Ubuntu vs CentOS part is just that it was Ubuntu that was available when we created that image. Most of our users are probably making their own image anyways.

At least OSG VO is using --nv to invoke the containers now.

Singularity versions are still varying widely on OSG, but most are at least running 3.4 or 3.5.

khurtado commented 4 years ago

@rynge Thank you for the explanation! And yes, I can see --nv now in the OSG wrapper @efajardo Is there any technical reason why the CMS singularity user job wrapper is not using --nv yet? Or is just because this task hasn't made it into the priority list yet (because e.g.: low demand so far)?

efajardo commented 4 years ago

Well it has not made it yet. I can make a Pull Request to adapt that.

khurtado commented 4 years ago

PR sounds good, thank you for the prompt response to both of you!

khurtado commented 4 years ago

Hi again. @efajardo I wanted to create an issue in: https://github.com/opensciencegrid/osg-flock/blob/master/job-wrappers but couldn't find a way for doing so. So I'm following this same thread instead (please, let me know if I should submit this somewhere else)

Even though using this container with singularity --nv works manually, I'm actually having problems running jobs in the OSG Connect using this image (running at the new ND GPU Cluster that was just added to the Factories).

When I try to run: https://github.com/OSGConnect/tutorial-tensorflow-matmul

I'm still seeing cuda errors, although this time it seems to be complaining about version formatting:

2020-03-20 17:08:25.157507: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
libcuda reported version is: Invalid argument: expected %d.%d, %d.%d.%d, or %d.%d.%d.%d form for driver version; got "1"
2020-03-20 17:08:25.157886: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 440.36.0
2020-03-20 17:08:25.159414: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:
Device mapping: no known devices.

Full output is here:

OSG Singularity wrapper: LD_LIBRARY_PATH is set to /cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/lib64:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/lib:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/usr/lib64:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/usr/lib:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/usr/lib64/dcap:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/usr/lib64/lcgdm:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/lib64:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/lib:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/usr/lib64:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/usr/lib:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/usr/lib64/dcap:/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.4/3.4.45/el7-x86_64/usr/lib64/lcgdm::/opt/condor/condor-8.8.7-x86_64_RedHat7-stripped/lib/condor outside Singularity. This will not be propagated to inside the container instance.
^[[33mWARNING:^[[0m group: unknown groupid 1003
^[[33mWARNING:^[[0m container does not have /.singularity.d/actions/exec, calling /srv/.osgvo-user-job-wrapper.sh directly
Hostname:qa-rtx6k-033.crc.nd.edu
Operative system:
Ubuntu 16.04.6 LTS \n \l

/bin/nvidia-smi

2020-03-20 17:08:25.136491: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-03-20 17:08:25.157507: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2020-03-20 17:08:25.157799: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: qa-rtx6k-033.crc.nd.edu
2020-03-20 17:08:25.157807: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: qa-rtx6k-033.crc.nd.edu
2020-03-20 17:08:25.157847: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: Invalid argument: expected %d.%d, %d.%d.%d, or %d.%d.%d.%d form for driver version; got "1"
2020-03-20 17:08:25.157886: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 440.36.0
2020-03-20 17:08:25.159414: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:

2020-03-20 17:08:25.163227: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:

2020-03-20 17:08:25.164483: I tensorflow/core/common_runtime/placer.cc:935] MatrixInverse: (MatrixInverse)/job:localhost/replica:0/task:0/device:CPU:0
2020-03-20 17:08:25.164499: I tensorflow/core/common_runtime/placer.cc:935] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:CPU:0
2020-03-20 17:08:25.164505: I tensorflow/core/common_runtime/placer.cc:935] Const: (Const)/job:localhost/replica:0/task:0/device:CPU:0
Device mapping: no known devices.
Device mapping: no known devices.
MatrixInverse: (MatrixInverse): /job:localhost/replica:0/task:0/device:CPU:0
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:CPU:0
Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
result of matrix multiplication
===============================
[[ 1.0000000e+00  0.0000000e+00]
 [-4.7683716e-07  1.0000002e+00]]
===============================