Open weiji14 opened 2 years ago
Might need to check if the Azure cluster supports GPU direct storage first
Yeah, any way to easily verify this? Maybe @quasiben has an idea what hardware / networking combination might work?
There's this script https://github.com/rapidsai/kvikio/blob/29c52f76035002d91f301895250c0ff14f18f50a/python/benchmarks/single-node-io.py to check for GDS compatibility. MIght need to install a few other packages to fix ImportErrors, but the gist is:
wget https://github.com/rapidsai/kvikio/blob/29c52f76035002d91f301895250c0ff14f18f50a/python/benchmarks/single-node-io.py
python single-node-io.py
These are the results I got on Microsoft Planetary Computer Pytorch container (copied from https://github.com/xarray-contrib/xbatcher/issues/87#issuecomment-1242803180):
----------------------------------
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING - KvikIO compat mode
libcufile.so not used
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
GPU | Unknown (install pynvml)
GPU Memory Total | Unknown (install pynvml)
BAR1 Memory Total | Unknown (install pynvml)
GDS driver | N/A (Compatibility Mode)
GDS config.json | /etc/cufile.json
----------------------------------
nbytes | 10485760 bytes (10.00 MiB)
4K aligned | True
pre-reg-buf | True
diretory | /tmp/tmp9a8nd5kz
nthreads | 1
nruns | 1
==================================
cufile read | 4.28 GiB/s
cufile write | 92.59 MiB/s
posix read | 1.23 GiB/s
posix write | 1.24 GiB/s
I don't have sudo permissions, but if you have time, maybe try sudo apt install nvidia-gds
on the staging container and see if NVIDIA GPU Direct Storage is supported?
Unfortunately, I don't think GDS is supported on cloud infra (even with mounted NVMe) but the GDS team is working on it. @cnewburn can you comment with additional thoughts ?
I spoke with GDS team and they are working on addressing this issue. We expect this to be available in next CUDA release
Hi there,
Was thinking if it's possible to enable NVIDIA GPU Direct Storage on Microsoft Planetary Computer? This could enable reading Zarr files directly into GPU memory from cloud storage, and we'd be excited to have a demo use-case running (xref https://github.com/xarray-contrib/xbatcher/issues/87).
Packages that need to be installed:
References:
Might need to check if the Azure cluster supports GPU direct storage first, but if it does, I can open up PRs to add these into the Pytorch and/or Tensorflow containers :smile: