premAI-io / prem-operator

📡 Deploy AI models and apps to Kubernetes without developing a hernia
https://premai.io?utm_source=prem-operator
Apache License 2.0
17 stars 2 forks source link

Redesign CI #9

Open richiejp opened 2 months ago

richiejp commented 2 months ago

There are a number of issues to break out here.

Possibly they all have the same solution or not, it needs to be investigated.

richiejp commented 2 months ago

Using GitHub's GPU beta runners the cost per minute is $0.07 and our e2e tests used to take 40 mins, so that is $2.8 per job. Per hour it is $4.2. Also these machines are based on Tesla T2's which are quite old now, although that is perhaps a good thing. Other CI providers have similar or greater costs. The price of an A16 machine on Vultr is $0.5 per hour for comparison, although we may have to manage creating and destroying instances.

That is CPU only as well, once we bring all the dependencies for GPU into the mix then a full run could take longer.

richiejp commented 2 months ago

V100 on DataCrunch (on demand) is $0.88 per hour

richiejp commented 2 months ago

One option would be to take a QEMU VM snapshot of a running k3s cluster with the NVIDIA operator installed. Then load the VM snapshot at the start of each test and do the install etc.

Problems:

Pros:

richiejp commented 2 months ago
richiejp commented 2 months ago

Seems there is only limited crossover between GPUs supporting virtualization/passthrough and GPUs supported by the operator. Pretty much limited to data center GPUs with passive cooling.