Open scottyhq opened 4 years ago
@scottyhq, I'd like to test out cuspatial on this GPU nodegroup, and it seems my PR to build a new container with cuSpatial worked, as I see it at https://hub.docker.com/r/scottyhq/jupyter-cupy.
When I selected "Pangeo ML Env (experimental)" on https://staging.aws-uswest2.pangeo.io, however, my environment didn't contain cuspatial
so it seems it's either not using the new container, or the new container doesn't contain cuspatial
.
Or am I selecting the wrong spawner option? I tried grepping the https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments for "Pangeo ML Env" and didn't find it, so quite possibly I'm just not understanding the workflow here.
@rsignell-usgs - my guess is you launched before the new 'latest' image was fully pushed to dockerhub and therefore pulled the old image without cuspatial.
This is a problem with using the 'latest' tag. Try again. When you select Pangeo ML image you should at some point see 2019-12-31 21:17:52+00:00 [Normal] Pulling image "scottyhq/jupyter-cupy:latest"
as the notebook is launching. I just tried this and successfully ran import cuspatial
@scottyhq , yes, working now!
@scottyhq I was able to start up the Pangeo ML Env (experimental)
image on staging.aws-uswest2.pangeo.io
and run a Tensorflow tutorial model successfully! So that's awesome. Thanks for setting this up!
One thing I couldn't do is run PyTorch. When I tried import torch
it said module not found
. Do you know why that would be?
@jsadler2 - I omitted pytorch in the test image. Is there separate gpu-enabled pytorch out there or does installing pytorch-cpu
from conda-forge pick up a GPU if it exists? I was just skimming over this https://github.com/conda-forge/pytorch-cpu-feedstock/issues/7 but didn't see an immediate answer.
Once https://github.com/pangeo-data/pangeo-stacks/pull/114 is merged, you can customize the image to your liking.
@scottyhq - I haven't worked with PyTorch before, but on the Anaconda documentation (https://docs.anaconda.com/anaconda/user-guide/tasks/gpu-packages/) it says:
PyTorch detects GPU availability at run-time, so the user does not need to install a different package for GPU support.
... so no need for another package.
@scottyhq - not sure what you mean by
Once pangeo-data/pangeo-stacks#114 is merged, you can customize the image to your liking.
did you actually mean https://github.com/pangeo-data/pangeo-stacks/pull/83? and for customization, I assume I will just change the requirements.txt
file and then create a PR. Is that right? I just want to make sure I am understanding this right.
@scottyhq Do you maintain any codebase for the actual AWS Infrastructure you have deployed to run GPUs on Pangeo? Or if not, could you provide some info about the GPU nodes you're running (ami and k8s tags)? Thanks!
@samuel-co - checkout out this blog post @jsadler2 posted today! It contains links to all the relevant AWS GPU node setup https://medium.com/pangeo/deep-learning-with-gpus-on-pangeo-9466e25bfd74
@jhamman, @yuvipanda, and I setup a test GPU nodegroup (p2.xlarge, g3s.xlarge instances) on staging.aws-uswest2.pangeo.io last week. It was less straightforward than we hoped, but seems to be working now! Thanks to the useful comments here https://github.com/pangeo-data/pangeo-stacks/pull/83 and here https://github.com/pangeo-data/pangeo-cloud-federation/issues/425. Note this currently gives the user notebook access to a single GPU.
The key hang-up on deploying this was that in order to mount
libcuda.so
from the AWS EKS OS (https://docs.aws.amazon.com/eks/latest/userguide/gpu-ami.html) so that the jupyter-user pod can use CUDA you have to have an environment variable in the Docker image or pod definition (NVIDIA_DRIVER_CAPABILITIES': 'compute,utility'
). Thanks @bgroenks96 for pointing this out in https://github.com/pangeo-data/pangeo-stacks/pull/83#issuecomment-562959760. This can't be done in a repo2docker start script. But it can be done with the following kubespawner setting:nvidia-smi
outputpinging @jsadler2 and @rsignell-usgs to kick the tires