mlcommons / modelbench

Run safety benchmarks against AI models and view detailed reports showing how well they performed.
https://mlcommons.org/ai-safety/
Apache License 2.0
53 stars 8 forks source link

Investigate proxy for clients to provision a model #458

Open rogthefrog opened 1 week ago

rogthefrog commented 1 week ago

Currently clients need to handle resource creation themselves (e.g. "give me this huggingface resource so I can run my test." This means they have to know the implementation details of the resource they need and we can't easily manage resource allocation, queuing, turning resources off when not needed, etc.

Investigate a proxy that would handle those things for clients, so they can just hit the proxy with a unified interface, and the proxy would handle the details.

wpietri commented 1 week ago

Thanks for looking into this! These are the approaches I think have been put forth so far.

  1. Something that is purely a resource management API.
    • You call it to say, "I need X", it spins one up, and then make sure it gets shut down in a timely fashion.
    • It collects resource usage data we can pull in into something like Prometheus/Grafana
    • Maybe it also has primitive quota management
  2. A straightforward HTTP proxy that handles all API requests.
    • By default all requests get sent through untouched, so minimal code changes.
    • For specific vendors, we notice activities related to resources, spinning them up and down as needed
    • Also for specific vendors, we manage resources sharing here. If multiple jobs are running, this shares the limited quota and/or number of GCP GPU machines in a fair fashion
    • Exports much richer data to Prometheus/Grafana
      1. A full LLM API wrapper
    • Clients write to this API directly
    • It abstracts the difference between different model providers, so that the client doesn't care which vendor is running their requested model
    • Again, stats are collected

For me the main considerations are:

Glad to chat further, and I look forward to seeing what you come up with.

rogthefrog commented 1 week ago

Thanks for the summary! That covers what I was thinking as well.

A few thoughts:

wpietri commented 1 week ago

Thanks for the summary! That covers what I was thinking as well.

A few thoughts:

  • How often do users want to run a given model on more than one vendor?

I think we have four use cases, in rough priority order.

  1. Our team running official benchmarks and related practice/development tests
  2. SUT owners running practice tests against their own SUTs
    • via modellab and our servers
    • offline, via their own servers
  3. Workstream participants (and us) doing exploratory work with our tools
  4. Third parties who are trying things out, thinking about registering with us and becoming SUT owners, or want to validate our benchmarks

For the first case, I think switching will be rare. For the second, I think it'll be custom SUTs with the connection details controlled by configuration. For the 3rd, I expect some swapping. E..g, moving around as model availability shifts, or comparing behavior for the same model across multiple servers. And the last doesn't matter, except to the extent we need to build with offline use cases in mind.

  • Given the high cost of running models and vendor availability limitations (e.g. our own vllm instance isn't guaranteed to run if the region is out of gpu capacity), it feels important to be able to spin resources up only when needed and down as soon as we're done, so investing on instrumenting usage metrics feels useful.

Yes, absolutely. The evaluator set they're talking about could plausibly touch $1000/day, so we're going to have to be careful here. And since we're dealing with multiple services, it will be even easier for something to keep running somewhere.

  • Letting users provide their own credentials so they can use their own vendor accounts feels worth it. Using MLC credentials for MLC runs is fine, but it won't just be MLC people forever.

Yeah, I think that will be especially important for the vendor case. We're going to want to give them an interface where they can configure SUTs at will, including endpoints and auth keys. And something automated that tracks when those break, so they keep the connections working.

It exposes us to IP liability as a user's data goes through us, which means we can intercept it, log it, etc. It also incurs runtime costs and complexity to manage its own scaling, fault tolerance, access keys, security, etc.

I was thinking of the proxy as only being for internal use, so I don't think the IP/security concerns apply much. To me an advantage of a proxy approach is that we can let SUTs be relatively vendor-specific, working normally for runs we don't control. But when we are doing runs, we specify a proxy to which we can gradually add accounting and quota management.

  • Quota and scheduling management: do the major vendors have APIs we can query to get information about current availability and usage?

I don't know if there are APIs; I expect we'd instead look at individual responses. Together, like some other services I've seen, gives back quota information in response headers. But we'd also want to look for HTTP responses that indicate being over quota and throttle back connections to the provider.

  • Quota and scheduling management (2): in addition to sending data to Grafana, it may be worth exposing usage metrics via an endpoint that a human or a program can query. A user could do their own quota management until we provide that: query the usage endpoint to see what's running, and then decide whether to request a resource now or try again later.

Interesting idea! The Together quotas reset very frequently, something like every second. But for longer-lived queries, that could be great. That said, I think just throttling requests could get us most of the benefit without any client work, so I'd want to add that feature later once we see demonstrated need.

wpietri commented 1 week ago

Ah, and here's a just-announced look at Together's rate limits: https://docs.together.ai/docs/rate-limits

Our rate limits are currently measured in requests per second (RPS) and tokens per second (TPS) for each model type. If you exceed any of the rate limits you will get a 429 error.