supabase / edge-runtime

A server based on Deno runtime, capable of running JavaScript, TypeScript, and WASM services.
MIT License
689 stars 62 forks source link

feat: onnx runtime shared sessions #430

Closed kallebysantos closed 3 weeks ago

kallebysantos commented 3 weeks ago

This PR is an adapted part of #368 work, that was closed due proposal changing

What kind of change does this PR introduce?

Feature, Enhancement

What is the current behavior?

model sessions are eager evaluated and not survive cross worker life-cycles

What is the new behavior?

image

This PR introduces shared sessions logic and more other ort improvements like GPU support and optimizations.

πŸ‘· It's also a base foundation for an owned onnx runtime that can be integrated direclty with huggingface/transformers.js library, that will allow a better inference without the needed of coupling models like gte-small on edge-runtime image.

Tester docker image:

You can get a docker image of this PR from docker hub:

# default runtime
docker pull kallebysantos/edge-runtime:latest

# gpu with cuda provider
docker pull kallebysantos/edge-runtime:latest-cuda

Session lifecycle:

This PR introduces a Lazy map of ort:Sessions, it means that sessions will be loaded once and then shared between worker cycles.

Cleaning up sessions: Each ort:Session is attached to an Arc smart pointer and will only be dropped if no consumer is attached to it, but in order to that users must explicit call the EdgeRuntime.ai.tryCleanupUnusedSession() method.

NOTE: This method is only available for the main worker

// cleanup unused sessions every 30s
setInterval(async () => {
  const { activeUserWorkersCount } = await EdgeRuntime.getRuntimeMetrics();
  if (activeUserWorkersCount > 0) {
    return;
  }
  try {
    const cleanupCount = await EdgeRuntime.ai.tryCleanupUnusedSession();
    if (cleanupCount == 0) {
      return;
    }
    console.log('EdgeRuntime.ai.tryCleanupUnusedSession', cleanupCount);
  } catch (e) {
    console.error(e.toString());
  }
}, 30 * 1000);

GPU Support:

The gpu support allows session inference in specialized hardware and its backed with CUDA. There is no configuration to do by the final user, just call the Session for gte-small. But in order to enable gpu inference the Dockerfile now has two main build stages (That should be specified during docker build) :

edge-runtime (CPU only): This stage builds the default edge-runtime, where ort::Session's are loaded using CPU.

docker build --target "edge-runtime" .

Resulting image with ~150 Mb

edge-runtime-cuda (GPU/CPU): This stage builds the default edge-runtime in a nvidia/cuda machine that allows loading using GPU or CPU(as fallback).

docker build --target "edge-runtime-cuda" .

Resulting image with ~2.20 GB

Each stage needs to install the appropriated onnx-runtime. So in order that, the install_onnx.sh has updated with a 4ΒΊ parameter flag --gpu, that will download a cuda version from the official microsoft/onnxruntime repository.

Using GPU image:

In order to use the gpu image the docker-compose file must include the following properties for the functions service:

services:
  functions:
    # Built was describe before
    image: supabase/edge-runtime:latest-cuda
    # or directly by compose
    build:
      context: .
      dockerfile: Dockerfile
      target: edge-runtime-cuda

   # Required to use gpu inside the container
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1 # Change here if more devices are installed
              capabilities: [gpu]

IMPORTANT NOTE: The target infrastructure must be prepared with NVIDIA Container Toolkit to allow gpu support inside docker


Final considerations:

Like I'd describe before, this is an adapted work from #368 where we spitted out only the core features that improves ort support for edge-runtime.

Finally, thanks for @nyannyacha that help me a loot πŸ™

nyannyacha commented 3 weeks ago

πŸ‘· It's also a base foundation for an owned onnx runtime that can be integrated direclty with https://github.com/huggingface/transformers.js/pull/947 library, that will allow a better inference without the needed of coupling models like gte-small on edge-runtime image.

If so, are you also preparing other PR after this PR has been merged? Overall, the PR looks good. 😁

nyannyacha commented 3 weeks ago

Anyway, I'll be testing this locally with k6 soon. If there are any issues, I'll let you know. πŸ˜‹