Set up GPU instances on the jupytearth hub

fperez commented 3 years ago

@consideRatio - I'm not sure if you've looked yet into the details of setting up GPU instances for our hub. We have some workloads that are starting to need GPUs (cc @espg), and it would be great to have a couple of options to play with.

Would you mind taking a look at the options? We can document here our process/choices... Thanks!

consideRatio commented 3 years ago

@fperez will do!

To avoid adding more complexity than needed, it would be know to have some guesswork on what kinds of computers could make sense to have GPU to support the @espg's relevant workflows. It can be one starting point to generalize a bit around.

Should normal user servers and/or dask worker servers have GPU?
Should nodes have one or more GPUs?
Is it important to have a lot of CPU/RAM or similar on the node with GPU(s)?

fperez commented 3 years ago

We don't need GPUs on all generic nodes, that gets expensive. Just an option with GPU in the starting menu, that we'll choose only when actually needed. We can also skip worrying about GPUs on Dask workers until we're certain we have both the single-node GPU story and the Dask story well-oiled for our workloads.
It's fine to do just one GPU for now.
I'll let @espg respond here - I'm not sure if he has any preference/idea of the needed mix of CPU and RAM capabilities compared to GPU ones.

espg commented 3 years ago

@consideRatio @fperez sorry for the delay on this; I wanted to do some empirical profiling before getting back with an answer, but I'm getting gummed up on with some of the libraries, so I'll try and scope out the conceptual layout of the pipeline and what I think would be a good start.

The GPU's are just convolving kernels on images, so GPU size (in RAM) is effectively matched to the image size. For now, to start, images are small (~15mb), but there are quite a lot of them-- in general ten's of thousands per 'day', with days running into the hundreds. The convolutions are effectively instant on all the GPU's I've used before, with the largest latency coming from a.) loading the images from the internet to disk, b.) loading them from disk to memory, and c.) loading them to and from the GPU's to run the convolution.

I think a good place to start would be with the G4dn single GPU VM instances. I don't have a good idea on the scaling for the CPU's that are needed to saturate; we want lots of workers hitting a single GPU... but I'm not sure if 'a lot' is 16 workers or 64. For non-GPU RAM, I expect scaling will follow with the number of workers, so any of the G4 instances will work.

Is it possible to setup more than one instance type, if they're all the same 'family' of instance to check on the scaling? I think the sweet spot is either on the low end around 8 or 16 cores, or the high end with 64-cores. I'd have to pre-load an image set to a directory, and see how many workers it takes to 'saturate' the GPU to know the scaling though.

The software side of this is pretty basic-- just cupy and dependencies (a cuda-friendly version of numba for jit'ing is probably in there too).

consideRatio commented 3 years ago

Thanks for input about this!

I've made some progress on this but we need to await that AWS increase the allowed quota so that we can use GPUs. I'm exposing 4 CPU/16 GB and 16 CPU/64 GB nodes with a single T4 Tensor Core GPU attached.

Note that if dask is to be involved using this GPU, a LocalCluster should be used because:

We don't have dask worker nodes with GPUs available for use
It is not possible* to share GPUs between k8s Pods

*Well it is, but it's a hack I think will be unreliable and would advice against us using.

Also, when the GPU is used by multiple workers in a local dask cluster, they may need to have self-imposed restrictions so they don't make each other crash by hogging too much memory and running out of it.

The case opened can be tracked here: https://console.aws.amazon.com/support/home#/case/?displayId=8818400181

espg commented 3 years ago

@consideRatio ok, thanks for getting this rolling! I can see the new instance types on the hub, but they aren't able to initiate yet. I'm assuming that's what the case reference that you posted is about... perhaps a basic question, but do you know how to find out my IAM username/password for the hub so I can check out that link? I can see the account number and what looks like a possible username if I run echo $AWS_ROLE_ARN inside the hub, but I can't quite figure out how to translate that to a signin, and using a separate AWS account won't let me see the issue either...

fperez commented 3 years ago

Shane, I think I need to change things in your AWS privileges to see those messages... I get a bit lost there so we can do this later, happy to give you access, it's just that each time I do it it takes me some time navigating the AWS console...

But the msg from AWS just says "we got your request, looking into it, will let you know"

consideRatio commented 3 years ago

@espg there are three accounts involved:

Your JupyterHub hub account (mirrors github)
The common JupyterHub AWS s3 storage credentials everyone have access working in a Jupyter server
The AWS admin access where I can for example monitor the cloud resources used etc.

Nobody besides me, having done some work to setup a k8s cluster and a s3 bucket for example, should need direct AWS access unless there is a need to manage some resources like that. Let me know if you want access from a learning perspective or similar though and I can grant it.

Btw @espg, I've prepared already by installing cupy and cudatoolkit in our image.

https://github.com/pangeo-data/jupyter-earth/blob/b77bbb9c20433057900f0421dc84a609197b98ef/hub.jupytearth.org-image/Dockerfile#L217-L218

consideRatio commented 2 years ago

@espg I think this is resolved on my end - you should be able to start a server with GPU attached, have you trialed if this worked out for you?

consideRatio commented 2 years ago

@espg reported it did not work even though it worked for me - this was likely because a configuration i made got reset. I think I know why the configuration I made was reset, and have made sure its configured correctly now again - in the future I will make sure to avoid having it reset.

Technical notes

I need to pass --install-nvidia-plugin=false whenever I recreate the GPU nodegroup using eksctl create nodegroup
I need to set the tolerations for the node's taints on the nvidia-device-plugin-daemonset daemonset

      # the tolerations I manually add via
      # kubectl edit daemonset -n kube-system nvidia-device-plugin-daemonset
      - effect: NoSchedule
        key: hub.jupyter.org_dedicated
        operator: Equal
        value: user

pangeo-data / jupyter-earth

Set up GPU instances on the jupytearth hub #77

Technical notes