pangeo-data / pangeo-cloud-federation

Deployment automation for Pangeo JupyterHubs on AWS, Google, and Azure
https://pangeo.io/cloud.html
59 stars 32 forks source link

GPUs on AWS Hubs #490

Open scottyhq opened 4 years ago

scottyhq commented 4 years ago

@jhamman, @yuvipanda, and I setup a test GPU nodegroup (p2.xlarge, g3s.xlarge instances) on staging.aws-uswest2.pangeo.io last week. It was less straightforward than we hoped, but seems to be working now! Thanks to the useful comments here https://github.com/pangeo-data/pangeo-stacks/pull/83 and here https://github.com/pangeo-data/pangeo-cloud-federation/issues/425. Note this currently gives the user notebook access to a single GPU.

The key hang-up on deploying this was that in order to mount libcuda.so from the AWS EKS OS (https://docs.aws.amazon.com/eks/latest/userguide/gpu-ami.html) so that the jupyter-user pod can use CUDA you have to have an environment variable in the Docker image or pod definition (NVIDIA_DRIVER_CAPABILITIES': 'compute,utility'). Thanks @bgroenks96 for pointing this out in https://github.com/pangeo-data/pangeo-stacks/pull/83#issuecomment-562959760. This can't be done in a repo2docker start script. But it can be done with the following kubespawner setting:

            {
                'display_name': 'Pangeo ML Env (experimental)',
                'kubespawner_override': {
                    'mem_limit': '60G',
                    'mem_guarantee': '25G',
                    'image': 'scottyhq/jupyter-cupy:latest',
                    'environment': {'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility'},
                    'tolerations': [{'key': 'nvidia.com/gpu','operator': 'Equal','value': 'present','effect': 'NoSchedule'}],
                    'extra_resource_limits': {"nvidia.com/gpu": "1"}
                }
            },

nvidia-smi output

Mon Dec 23 06:06:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   41C    P0    38W / 150W |    106MiB /  7618MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

pinging @jsadler2 and @rsignell-usgs to kick the tires

rsignell-usgs commented 4 years ago

@scottyhq, I'd like to test out cuspatial on this GPU nodegroup, and it seems my PR to build a new container with cuSpatial worked, as I see it at https://hub.docker.com/r/scottyhq/jupyter-cupy.

When I selected "Pangeo ML Env (experimental)" on https://staging.aws-uswest2.pangeo.io, however, my environment didn't contain cuspatial so it seems it's either not using the new container, or the new container doesn't contain cuspatial.

Or am I selecting the wrong spawner option? I tried grepping the https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments for "Pangeo ML Env" and didn't find it, so quite possibly I'm just not understanding the workflow here.

scottyhq commented 4 years ago

@rsignell-usgs - my guess is you launched before the new 'latest' image was fully pushed to dockerhub and therefore pulled the old image without cuspatial.

This is a problem with using the 'latest' tag. Try again. When you select Pangeo ML image you should at some point see 2019-12-31 21:17:52+00:00 [Normal] Pulling image "scottyhq/jupyter-cupy:latest" as the notebook is launching. I just tried this and successfully ran import cuspatial

rsignell-usgs commented 4 years ago

@scottyhq , yes, working now!

jsadler2 commented 4 years ago

@scottyhq I was able to start up the Pangeo ML Env (experimental) image on staging.aws-uswest2.pangeo.io and run a Tensorflow tutorial model successfully! So that's awesome. Thanks for setting this up! One thing I couldn't do is run PyTorch. When I tried import torch it said module not found. Do you know why that would be?

scottyhq commented 4 years ago

@jsadler2 - I omitted pytorch in the test image. Is there separate gpu-enabled pytorch out there or does installing pytorch-cpu from conda-forge pick up a GPU if it exists? I was just skimming over this https://github.com/conda-forge/pytorch-cpu-feedstock/issues/7 but didn't see an immediate answer.

Once https://github.com/pangeo-data/pangeo-stacks/pull/114 is merged, you can customize the image to your liking.

jsadler2 commented 4 years ago

@scottyhq - I haven't worked with PyTorch before, but on the Anaconda documentation (https://docs.anaconda.com/anaconda/user-guide/tasks/gpu-packages/) it says:

PyTorch detects GPU availability at run-time, so the user does not need to install a different package for GPU support.

... so no need for another package.

jsadler2 commented 4 years ago

@scottyhq - not sure what you mean by

Once pangeo-data/pangeo-stacks#114 is merged, you can customize the image to your liking.

did you actually mean https://github.com/pangeo-data/pangeo-stacks/pull/83? and for customization, I assume I will just change the requirements.txt file and then create a PR. Is that right? I just want to make sure I am understanding this right.

samuel-co commented 4 years ago

@scottyhq Do you maintain any codebase for the actual AWS Infrastructure you have deployed to run GPUs on Pangeo? Or if not, could you provide some info about the GPU nodes you're running (ami and k8s tags)? Thanks!

scottyhq commented 4 years ago

@samuel-co - checkout out this blog post @jsadler2 posted today! It contains links to all the relevant AWS GPU node setup https://medium.com/pangeo/deep-learning-with-gpus-on-pangeo-9466e25bfd74