GPU Dyno Types on Heroku

elimchaysengSF commented 1 year ago

By leveraging technologies like GPUs, Heroku can provide customers with faster and more efficient AI computing capabilities all with the Heroku DX to provide the “Heroku magic” customer experience while deploying and building LLM applications or other GPU use-cases to be determined. Heroku's original business value—allowing customers to focus on code and not infrastructure—could evolve to a new level with GPU dynos.

Let us know how you'd go about using a GPU Dyno type or any specifics around the instances you'd like to see brought onto the Heroku Platform.

joe-tyler commented 1 year ago

We use ffmpeg for video manipulation and it can take advantage of the processing power in a GPU. Without it available at Heroku we will have to move this processing to AWS or Google Cloud.

batwood001 commented 1 year ago

My company urgently needs this. We are running GPUs on other clouds and will be moving all of our infrastructure off of Heroku entirely unless we can access GPU instances.

elimchaysengSF commented 1 year ago

@batwood001 - very interested in your use case / how you see it mixing with your current Heroku setup. Feel free to email me to chat more: elimchayseng@heroku.com or feel free to reply here too

cagejsn-importal commented 2 months ago

Hi, commenting here for heroku team to understand a little bit about what a small agile dev team needs in 2024.

We are a startup running on Heroku -- for the most part, i love the service. It does a great job simplifying things so we can spend our time building product.

we encountered a use case where we need to be able to run inference with a small LLM (perhaps there is a different conversation about whether or not this was the right architectural decision, but for now just roll with me).

We already had a python runtime dyno which runs 'jobs' and sends results back to our other main dyno (typescript) via. bullmq. In order to pull this off, we use the heroku data for redis add on.

Since we needed to do LLM inference on some sensitive customer data (cannot be sent to openai), we decided to add llama-cpp-python and grab a smallish model from hugging face. Voilà! everything works ... it's just SO SLOW (running on CPU only w/o BLAS acceleration).

We are going to be moving our infra to AWS to get this particular use case to go faster, we took it as a sign that we were out growing our heroku.

Other fun details

we used the sheared-llama-1.3b-sharegpt.Q4_K_M.gguf because it was smallish, had the right license, and was quantized to ~4bits per weight (better for memory consumption)
- model size ~800mb
on a 2x dyno, inference for a 2048 context window takes about 30 mins (on the 2x dyno memory limit is 1Gb)
- in order to keep memory usage low, we have mmap turned on, which keeps parts of the model on disk.
on a M dyno (the next step up), we can run the same inference in 1 min (2.5 GB memory limit allows us to turn off mmap, putting whole model in RAM).
- I am not sure if the mmap toggle is actually what gives us the 30x speed up -- it could also be something to do with the (shared compute / dedicated compute) differences between 2x dyno and M dyno.
with mmap OFF on the M dyno we use about 1.5 GB of ram -- it spikes to about 2.2GB on model load and such.

jimkring commented 2 months ago

We have some simple, infrequent inference tasks that could work well on a small model, like classification of content as spam, etc. Would be great to be able to run those tasks on an OpenAI-compatible endpoint “locally” (to a standard heroku hostname or a port accessible on localhost). Heroku could set an OPENAI_API_KEY and OPENAPI_BASE_URL automatically for my app and charge based on model and token use. This would save me a lot of headache and would add security and independence on additional services/accounts.

heroku / roadmap

GPU Dyno Types on Heroku #202

Other fun details