microsoft / planetary-computer-containers

Container definitions for the Planetary Computer
MIT License
53 stars 12 forks source link

NVIDIA GPU direct storage #51

Open weiji14 opened 2 years ago

weiji14 commented 2 years ago

Hi there,

Was thinking if it's possible to enable NVIDIA GPU Direct Storage on Microsoft Planetary Computer? This could enable reading Zarr files directly into GPU memory from cloud storage, and we'd be excited to have a demo use-case running (xref https://github.com/xarray-contrib/xbatcher/issues/87).

Packages that need to be installed:

References:

Might need to check if the Azure cluster supports GPU direct storage first, but if it does, I can open up PRs to add these into the Pytorch and/or Tensorflow containers :smile:

TomAugspurger commented 2 years ago

Might need to check if the Azure cluster supports GPU direct storage first

Yeah, any way to easily verify this? Maybe @quasiben has an idea what hardware / networking combination might work?

weiji14 commented 2 years ago

There's this script https://github.com/rapidsai/kvikio/blob/29c52f76035002d91f301895250c0ff14f18f50a/python/benchmarks/single-node-io.py to check for GDS compatibility. MIght need to install a few other packages to fix ImportErrors, but the gist is:

wget https://github.com/rapidsai/kvikio/blob/29c52f76035002d91f301895250c0ff14f18f50a/python/benchmarks/single-node-io.py
python single-node-io.py

These are the results I got on Microsoft Planetary Computer Pytorch container (copied from https://github.com/xarray-contrib/xbatcher/issues/87#issuecomment-1242803180):

----------------------------------
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   WARNING - KvikIO compat mode   
      libcufile.so not used       
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
GPU               | Unknown (install pynvml)
GPU Memory Total  | Unknown (install pynvml)
BAR1 Memory Total | Unknown (install pynvml)
GDS driver        | N/A (Compatibility Mode)
GDS config.json   | /etc/cufile.json
----------------------------------
nbytes            | 10485760 bytes (10.00 MiB)
4K aligned        | True
pre-reg-buf       | True
diretory          | /tmp/tmp9a8nd5kz
nthreads          | 1
nruns             | 1
==================================
cufile read       |   4.28 GiB/s
cufile write      |  92.59 MiB/s
posix read        |   1.23 GiB/s
posix write       |   1.24 GiB/s

I don't have sudo permissions, but if you have time, maybe try sudo apt install nvidia-gds on the staging container and see if NVIDIA GPU Direct Storage is supported?

quasiben commented 2 years ago

Unfortunately, I don't think GDS is supported on cloud infra (even with mounted NVMe) but the GDS team is working on it. @cnewburn can you comment with additional thoughts ?

quasiben commented 2 years ago

I spoke with GDS team and they are working on addressing this issue. We expect this to be available in next CUDA release