operate-first / support

This repo should serve as a central source for users to raise issues/questions/requests for Operate First.
GNU General Public License v3.0
15 stars 25 forks source link

Can I provision a namespace on one of the GPU enabled clusters to deploy a VM? #539

Closed MichaelClifford closed 2 years ago

MichaelClifford commented 2 years ago

Description

I am working on a demo for summit and one of the components we need is a self driving car simulator. We currently have a VM set up on Smaug in the train-model namespace. However, due to the graphics requirements of the simulator, it does not work in that cluster. In the mean time (and to confirm the need for GPU hardware) we used an AWS VM. Now, we would like to have this simulator running in an Operate First managed cluster.

Due to the GPU requirements, I believe the this would involve deploying this to one of the OSC clusters.

What is the best way to get a GPU enabled VM on an Operate First managed clusters? Thanks!

MichaelClifford commented 2 years ago

cc @HumairAK @redmikhail

HumairAK commented 2 years ago

also @durandom

redmikhail commented 2 years ago

@MichaelClifford OSC clusters currently don't have kubevirt installed and since it supposed to be template for the OSC platform we potentially may want to avoid adding platform components (especially as complex as kubevirt) that are not directly needed by the OSC community. Considering that kubevirt primarily was targeted for bare metal installation EC2 vm's in AWS may also not be the best performing target for the platform( we would be running virtualization on top of virtualization ) . More important issue, however, is that we currently using NVIDIA GPU Operator to allow access to GPU resources from notebook containers , how KubeVirt GPU Device Plugin that is required for enabling GPU passthrough functionality in VM's will interact with gpu operator is not clear. Considering potential complexity and untested nature of this setup I would advise against trying it on existing OSC clusters and instead considering spinning separate cluster where we can test this in isolation. We can approach OSC community and see if they willing to allow usage of additional AWS resources. I believe that this test could be also beneficial for OSC.

MichaelClifford commented 2 years ago

closing for now. OS-C Subproject members decided this would not be an appropriate use case for either of the current OS-C clusters.

We can reopen when there are other GPU enabled clusters as part of the OPF cloud.