rapidsai / deployment

RAPIDS Deployment Documentation
https://docs.rapids.ai/deployment/stable/
9 stars 27 forks source link

Second pass at NGC + VertexAI Integration #191

Open betatim opened 1 year ago

betatim commented 1 year ago

Reduce the number of clicks to launch on VertexAI from NGC.

betatim commented 1 year ago

Clicked my way through from NGC to a Vertex notebook. The custom kernel takes minutes to appear. At first it wasn't clear to me why there was no Rapids kernel/that after loading the notebook UI I had to wait for an additional kernel to appear. I think this is a bit weird/most people won't expect that?

Once I had the custom Rapids kernel opening a notebook that used it was quite quick (c.f. "takes 8min to spin up" comment above). However when I tried to execute import cudf in the notebook I got an error message:

/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/utils/gpu_utils.py:62: UserWarning: Failed to dlopen libcuda.so
  warnings.warn(str(e))

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/__init__.py in <module>
     17 try:
---> 18     from cupy import _core  # NOQA
     19 except ImportError as exc:

/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/_core/__init__.py in <module>
      2 
----> 3 from cupy._core import core  # NOQA
      4 from cupy._core import fusion  # NOQA

cupy/_core/core.pyx in init cupy._core.core()

/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/cuda/__init__.py in <module>
      7 from cupy._environment import get_hipcc_path  # NOQA
----> 8 from cupy.cuda import compiler  # NOQA
      9 from cupy.cuda import device  # NOQA

/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/cuda/compiler.py in <module>
     13 from cupy.cuda import device
---> 14 from cupy.cuda import function
     15 from cupy.cuda import get_rocm_path

cupy/cuda/function.pyx in init cupy.cuda.function()

cupy/_core/_carray.pyx in init cupy._core._carray()

cupy/_core/internal.pyx in init cupy._core.internal()

cupy/cuda/memory.pyx in init cupy.cuda.memory()

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

The above exception was the direct cause of the following exception:

ImportError                               Traceback (most recent call last)
/tmp/ipykernel_7/619004098.py in <module>
----> 1 import cudf

/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/__init__.py in <module>
      5 validate_setup()
      6 
----> 7 import cupy
      8 from numba import config as numba_config, cuda
      9 

/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/__init__.py in <module>
     18     from cupy import _core  # NOQA
     19 except ImportError as exc:
---> 20     raise ImportError(f'''
     21 ================================================================
     22 {_environment._diagnose_import_error()}

ImportError: 
================================================================
Failed to import CuPy.

If you installed CuPy via wheels (cupy-cudaXXX or cupy-rocm-X-X), make sure that the package matches with the version of CUDA or ROCm installed.

On Linux, you may need to set LD_LIBRARY_PATH environment variable depending on how you installed CUDA/ROCm.
On Windows, try setting CUDA_PATH environment variable.

Check the Installation Guide for details:
  https://docs.cupy.dev/en/latest/install.html

Original error:
  ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
================================================================

The shared library exists in /usr/local/cuda-11.2/compat/libcuda.so.1 but LD_LIBRARYPATH is /usr/local/nvidia/lib:/usr/local/nvidia/lib64.

The docker image that was used is nvcr.io/nvidia/rapidsai/rapidsai:cuda11.2-runtime-centos7-py3.10

betatim commented 1 year ago

Weirdly enough, when I look at the "managed notebooks" tab in the Google Cloud UI it tells me that the notebook doesn't have a GPU.

jacobtomlinson commented 1 year ago

Thanks @betatim!

I ran through myself and found similar things. Here are the steps I followed:

Then I ran through a second time (partly to refresh my memory and write the list above) and the process felt a little different.

image