Second pass at NGC + VertexAI Integration

betatim commented 1 year ago

Reduce the number of clicks to launch on VertexAI from NGC.

RAPIDS Kernel takes a while to spin up (see Activity Log in VertexAI)
RAPIDS Kernel is not the default (import cudf fails in the default kernel; users need to select the RAPIDS kernel once it is available)
Notebook takes about 8 minutes to spin up. Would be good to get this down.

betatim commented 1 year ago

Clicked my way through from NGC to a Vertex notebook. The custom kernel takes minutes to appear. At first it wasn't clear to me why there was no Rapids kernel/that after loading the notebook UI I had to wait for an additional kernel to appear. I think this is a bit weird/most people won't expect that?

Once I had the custom Rapids kernel opening a notebook that used it was quite quick (c.f. "takes 8min to spin up" comment above). However when I tried to execute import cudf in the notebook I got an error message:

/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/utils/gpu_utils.py:62: UserWarning: Failed to dlopen libcuda.so
  warnings.warn(str(e))

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/__init__.py in <module>
     17 try:
---> 18     from cupy import _core  # NOQA
     19 except ImportError as exc:

/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/_core/__init__.py in <module>
      2 
----> 3 from cupy._core import core  # NOQA
      4 from cupy._core import fusion  # NOQA

cupy/_core/core.pyx in init cupy._core.core()

/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/cuda/__init__.py in <module>
      7 from cupy._environment import get_hipcc_path  # NOQA
----> 8 from cupy.cuda import compiler  # NOQA
      9 from cupy.cuda import device  # NOQA

/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/cuda/compiler.py in <module>
     13 from cupy.cuda import device
---> 14 from cupy.cuda import function
     15 from cupy.cuda import get_rocm_path

cupy/cuda/function.pyx in init cupy.cuda.function()

cupy/_core/_carray.pyx in init cupy._core._carray()

cupy/_core/internal.pyx in init cupy._core.internal()

cupy/cuda/memory.pyx in init cupy.cuda.memory()

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

The above exception was the direct cause of the following exception:

ImportError                               Traceback (most recent call last)
/tmp/ipykernel_7/619004098.py in <module>
----> 1 import cudf

/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/__init__.py in <module>
      5 validate_setup()
      6 
----> 7 import cupy
      8 from numba import config as numba_config, cuda
      9 

/opt/conda/envs/rapids/lib/python3.10/site-packages/cupy/__init__.py in <module>
     18     from cupy import _core  # NOQA
     19 except ImportError as exc:
---> 20     raise ImportError(f'''
     21 ================================================================
     22 {_environment._diagnose_import_error()}

ImportError: 
================================================================
Failed to import CuPy.

If you installed CuPy via wheels (cupy-cudaXXX or cupy-rocm-X-X), make sure that the package matches with the version of CUDA or ROCm installed.

On Linux, you may need to set LD_LIBRARY_PATH environment variable depending on how you installed CUDA/ROCm.
On Windows, try setting CUDA_PATH environment variable.

Check the Installation Guide for details:
  https://docs.cupy.dev/en/latest/install.html

Original error:
  ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
================================================================

The shared library exists in /usr/local/cuda-11.2/compat/libcuda.so.1 but LD_LIBRARYPATH is /usr/local/nvidia/lib:/usr/local/nvidia/lib64.

The docker image that was used is nvcr.io/nvidia/rapidsai/rapidsai:cuda11.2-runtime-centos7-py3.10

betatim commented 1 year ago

Weirdly enough, when I look at the "managed notebooks" tab in the Google Cloud UI it tells me that the notebook doesn't have a GPU.

jacobtomlinson commented 1 year ago

Thanks @betatim!

I ran through myself and found similar things. Here are the steps I followed:

Logged into https://ngc.nvidia.com
Went to "Catalog" and searched "vertex"
Clicked the first result called "Vertex AI Workbench - Quick Deploy"
- I flailed about a bit here as there is no clear CTA on this page.
- After reading the page in full I realised I needed to click the "Entities" tab
Clicked on "RAPIDS" in the entities tab.
Click the "Deploy to VertexAI" CTA in the top right (and "Deploy" again on the popup).
Landed on a page asking me to follow the tutorial on the right of the page (tutorial didn't open)
- Instructions said refresh the page if tutorial doesn't open which I did and then the tutorial opened.
Clicked "Next" in the tutorial.
The main page populated with the create form (I stopped reading the tutorial at this point)
- I added my name to the autogenerated notebook name so folks know who owns it (maybe unnecessary because it shows who created it on the list of notebook servers)
- I set the permissions to single user
- I dropped down the "advanced" section to see what was in there
- I noticed "install GPU drivers" was unchecked so I checked it
- I changed the GPU to a T4
I clicked "create" which showed a loading bar for a while and then a link to open the notebook
Jupyter Lab opened and asked me to authenticate again with Google Cloud. This failed with a permissions error.
I reloaded the page and clicked authenticate again and it worked this time.
I landed in Jupyter Lab and also noticed that there wasn't a RAPIDS kernel and was confused about what to do.
I opened a terminal and ran nvidia-smi and saw the T4
I navigated back to the main page of Jupyter Lab and noticed a sidebar saying "Loading kernel from nvcr.io/nvidia/rapidsai/rapidsai:cuda11.2-runtime-centos7-py3.10"
Once this was finished the RAPIDS kernel appeared and I was able to import cudf.
I headed back to GCloud and deleted the notebook.

Then I ran through a second time (partly to refresh my memory and write the list above) and the process felt a little different.

Logged into https://ngc.nvidia.com
Went to "Catalog" and searched "vertex"
Clicked the first result called "Vertex AI Workbench - Quick Deploy"
Clicked the "Entities" tab and then on "RAPIDS"
Click the "Deploy to VertexAI" CTA in the top right (and "Deploy" again on the popup).
I land on a page showing existing notebook servers (I didn't see this before, perhaps related to the auth permissions thing that magically fixed itself?)
The tutorial list opened on the right but didn't open the tutorial automatically. No amount of refreshing got the list to show the right thing.
I notice some radio buttons on the main window with "Select an existing managed notebook" selected
I changed it to "Create a new managed notebook" which got me back to the form that the tutorial got me to the first time
- I set the permissions to single user
- This time I noticed the advanced section had a red exclamation next to it so I opened it again
- The red exclamation was next to the unchecked "install GPU drivers" box so I checked it again
- I changed the GPU to a T4
I clicked "create" which showed a loading bar for a while and then a link to open the notebook
Jupyter Lab opened and asked me to authenticate again with Google Cloud.
This time I looked at the "Activity log" on the sidebar and saw that it said "Loading kernel from nvcr.io/nvidia/rapidsai/rapidsai:cuda11.2-runtime-centos7-py3.10" so I just waited
After ~10 mins the RAPIDS kernel popped up on the list and I was able to use RAPIDS

I headed back to GCloud and deleted the notebook.

rapidsai / deployment

Second pass at NGC + VertexAI Integration #191