Investigate proxy for clients to provision a model

Currently clients need to handle resource creation themselves (e.g. "give me this huggingface resource so I can run my test." This means they have to know the implementation details of the resource they need and we can't easily manage resource allocation, queuing, turning resources off when not needed, etc.

Investigate a proxy that would handle those things for clients, so they can just hit the proxy with a unified interface, and the proxy would handle the details.

Thanks for looking into this! These are the approaches I think have been put forth so far.

Something that is purely a resource management API.
- You call it to say, "I need X", it spins one up, and then make sure it gets shut down in a timely fashion.
- It collects resource usage data we can pull in into something like Prometheus/Grafana
- Maybe it also has primitive quota management
A straightforward HTTP proxy that handles all API requests.
- By default all requests get sent through untouched, so minimal code changes.
- For specific vendors, we notice activities related to resources, spinning them up and down as needed
- Also for specific vendors, we manage resources sharing here. If multiple jobs are running, this shares the limited quota and/or number of GCP GPU machines in a fair fashion
- Exports much richer data to Prometheus/Grafana
  1. A full LLM API wrapper
- Clients write to this API directly
- It abstracts the difference between different model providers, so that the client doesn't care which vendor is running their requested model
- Again, stats are collected

For me the main considerations are:

supporting operational needs (e.g., cost control, problem detection, activity monitoring)
making our coding easier
still allowing for others to run the benchmarks, both for general transparency and for model vendors doing practice tests without having to involve us

Glad to chat further, and I look forward to seeing what you come up with.

Thanks for the summary! That covers what I was thinking as well.

A few thoughts:

How often do users want to run a given model on more than one vendor? Intuitively I expect that to be uncommon, and that people are more likely to say "run x on huggingface" (i.e. x and huggingface "go together" even if x is available elsewhere). So exposing a vendor-agnostic abstraction to end users doesn't seem necessary. We could maintain a model-vendor list, so a user can just say "give me a resource to run X" and we provision the right resources and return the connection info, so the end user doesn't need to specify the model AND the runner, but even that seems like a marginal feature to me.
Given the high cost of running models and vendor availability limitations (e.g. our own vllm instance isn't guaranteed to run if the region is out of gpu capacity), it feels important to be able to spin resources up only when needed and down as soon as we're done, so investing on instrumenting usage metrics feels useful.
Letting users provide their own credentials so they can use their own vendor accounts feels worth it. Using MLC credentials for MLC runs is fine, but it won't just be MLC people forever.
I don't know if proxying the API calls to the model transparently brings a lot of value. It exposes us to IP liability as a user's data goes through us, which means we can intercept it, log it, etc. It also incurs runtime costs and complexity to manage its own scaling, fault tolerance, access keys, security, etc. And if some unscrupulous person DDoSes huggingface through the proxy, now we look like the bad guys. OTOH it would allow us to queue up requests and manage backend/vendor resource utilization; but it can't commit to be 100% available when vendors aren't; we can throttle or buffer requests and handle transient issues on the vendor side, but there will be times when something fails hard and we return an error to the client. This means the client needs to be able to handle errors from our proxy. Since they also need to handle errors if they hit the vendor directly, and we can't entirely shield them from handling any errors at all, then I don't think the benefit outweighs the cost/complexity/liability of a full pass-through proxy.
Quota and scheduling management: do the major vendors have APIs we can query to get information about current availability and usage?
Quota and scheduling management (2): in addition to sending data to Grafana, it may be worth exposing usage metrics via an endpoint that a human or a program can query. A user could do their own quota management until we provide that: query the usage endpoint to see what's running, and then decide whether to request a resource now or try again later.

Thanks for the summary! That covers what I was thinking as well.

A few thoughts:

How often do users want to run a given model on more than one vendor?

I think we have four use cases, in rough priority order.

Our team running official benchmarks and related practice/development tests
SUT owners running practice tests against their own SUTs
- via modellab and our servers
- offline, via their own servers
Workstream participants (and us) doing exploratory work with our tools
Third parties who are trying things out, thinking about registering with us and becoming SUT owners, or want to validate our benchmarks

For the first case, I think switching will be rare. For the second, I think it'll be custom SUTs with the connection details controlled by configuration. For the 3rd, I expect some swapping. E..g, moving around as model availability shifts, or comparing behavior for the same model across multiple servers. And the last doesn't matter, except to the extent we need to build with offline use cases in mind.

Given the high cost of running models and vendor availability limitations (e.g. our own vllm instance isn't guaranteed to run if the region is out of gpu capacity), it feels important to be able to spin resources up only when needed and down as soon as we're done, so investing on instrumenting usage metrics feels useful.

Yes, absolutely. The evaluator set they're talking about could plausibly touch $1000/day, so we're going to have to be careful here. And since we're dealing with multiple services, it will be even easier for something to keep running somewhere.

Letting users provide their own credentials so they can use their own vendor accounts feels worth it. Using MLC credentials for MLC runs is fine, but it won't just be MLC people forever.

Yeah, I think that will be especially important for the vendor case. We're going to want to give them an interface where they can configure SUTs at will, including endpoints and auth keys. And something automated that tracks when those break, so they keep the connections working.

It exposes us to IP liability as a user's data goes through us, which means we can intercept it, log it, etc. It also incurs runtime costs and complexity to manage its own scaling, fault tolerance, access keys, security, etc.

I was thinking of the proxy as only being for internal use, so I don't think the IP/security concerns apply much. To me an advantage of a proxy approach is that we can let SUTs be relatively vendor-specific, working normally for runs we don't control. But when we are doing runs, we specify a proxy to which we can gradually add accounting and quota management.

Quota and scheduling management: do the major vendors have APIs we can query to get information about current availability and usage?

I don't know if there are APIs; I expect we'd instead look at individual responses. Together, like some other services I've seen, gives back quota information in response headers. But we'd also want to look for HTTP responses that indicate being over quota and throttle back connections to the provider.

Quota and scheduling management (2): in addition to sending data to Grafana, it may be worth exposing usage metrics via an endpoint that a human or a program can query. A user could do their own quota management until we provide that: query the usage endpoint to see what's running, and then decide whether to request a resource now or try again later.

Interesting idea! The Together quotas reset very frequently, something like every second. But for longer-lived queries, that could be great. That said, I think just throttling requests could get us most of the benefit without any client work, so I'd want to add that feature later once we see demonstrated need.

Ah, and here's a just-announced look at Together's rate limits: https://docs.together.ai/docs/rate-limits

Our rate limits are currently measured in requests per second (RPS) and tokens per second (TPS) for each model type. If you exceed any of the rate limits you will get a 429 error.

mlcommons / modelbench

Investigate proxy for clients to provision a model #458