Open rogthefrog opened 1 week ago
Thanks for looking into this! These are the approaches I think have been put forth so far.
For me the main considerations are:
Glad to chat further, and I look forward to seeing what you come up with.
Thanks for the summary! That covers what I was thinking as well.
A few thoughts:
How often do users want to run a given model on more than one vendor? Intuitively I expect that to be uncommon, and that people are more likely to say "run x on huggingface" (i.e. x and huggingface "go together" even if x is available elsewhere). So exposing a vendor-agnostic abstraction to end users doesn't seem necessary. We could maintain a model-vendor list, so a user can just say "give me a resource to run X" and we provision the right resources and return the connection info, so the end user doesn't need to specify the model AND the runner, but even that seems like a marginal feature to me.
Given the high cost of running models and vendor availability limitations (e.g. our own vllm instance isn't guaranteed to run if the region is out of gpu capacity), it feels important to be able to spin resources up only when needed and down as soon as we're done, so investing on instrumenting usage metrics feels useful.
Letting users provide their own credentials so they can use their own vendor accounts feels worth it. Using MLC credentials for MLC runs is fine, but it won't just be MLC people forever.
I don't know if proxying the API calls to the model transparently brings a lot of value. It exposes us to IP liability as a user's data goes through us, which means we can intercept it, log it, etc. It also incurs runtime costs and complexity to manage its own scaling, fault tolerance, access keys, security, etc. And if some unscrupulous person DDoSes huggingface through the proxy, now we look like the bad guys. OTOH it would allow us to queue up requests and manage backend/vendor resource utilization; but it can't commit to be 100% available when vendors aren't; we can throttle or buffer requests and handle transient issues on the vendor side, but there will be times when something fails hard and we return an error to the client. This means the client needs to be able to handle errors from our proxy. Since they also need to handle errors if they hit the vendor directly, and we can't entirely shield them from handling any errors at all, then I don't think the benefit outweighs the cost/complexity/liability of a full pass-through proxy.
Quota and scheduling management: do the major vendors have APIs we can query to get information about current availability and usage?
Quota and scheduling management (2): in addition to sending data to Grafana, it may be worth exposing usage metrics via an endpoint that a human or a program can query. A user could do their own quota management until we provide that: query the usage endpoint to see what's running, and then decide whether to request a resource now or try again later.
Thanks for the summary! That covers what I was thinking as well.
A few thoughts:
- How often do users want to run a given model on more than one vendor?
I think we have four use cases, in rough priority order.
For the first case, I think switching will be rare. For the second, I think it'll be custom SUTs with the connection details controlled by configuration. For the 3rd, I expect some swapping. E..g, moving around as model availability shifts, or comparing behavior for the same model across multiple servers. And the last doesn't matter, except to the extent we need to build with offline use cases in mind.
- Given the high cost of running models and vendor availability limitations (e.g. our own vllm instance isn't guaranteed to run if the region is out of gpu capacity), it feels important to be able to spin resources up only when needed and down as soon as we're done, so investing on instrumenting usage metrics feels useful.
Yes, absolutely. The evaluator set they're talking about could plausibly touch $1000/day, so we're going to have to be careful here. And since we're dealing with multiple services, it will be even easier for something to keep running somewhere.
- Letting users provide their own credentials so they can use their own vendor accounts feels worth it. Using MLC credentials for MLC runs is fine, but it won't just be MLC people forever.
Yeah, I think that will be especially important for the vendor case. We're going to want to give them an interface where they can configure SUTs at will, including endpoints and auth keys. And something automated that tracks when those break, so they keep the connections working.
It exposes us to IP liability as a user's data goes through us, which means we can intercept it, log it, etc. It also incurs runtime costs and complexity to manage its own scaling, fault tolerance, access keys, security, etc.
I was thinking of the proxy as only being for internal use, so I don't think the IP/security concerns apply much. To me an advantage of a proxy approach is that we can let SUTs be relatively vendor-specific, working normally for runs we don't control. But when we are doing runs, we specify a proxy to which we can gradually add accounting and quota management.
- Quota and scheduling management: do the major vendors have APIs we can query to get information about current availability and usage?
I don't know if there are APIs; I expect we'd instead look at individual responses. Together, like some other services I've seen, gives back quota information in response headers. But we'd also want to look for HTTP responses that indicate being over quota and throttle back connections to the provider.
- Quota and scheduling management (2): in addition to sending data to Grafana, it may be worth exposing usage metrics via an endpoint that a human or a program can query. A user could do their own quota management until we provide that: query the usage endpoint to see what's running, and then decide whether to request a resource now or try again later.
Interesting idea! The Together quotas reset very frequently, something like every second. But for longer-lived queries, that could be great. That said, I think just throttling requests could get us most of the benefit without any client work, so I'd want to add that feature later once we see demonstrated need.
Ah, and here's a just-announced look at Together's rate limits: https://docs.together.ai/docs/rate-limits
Our rate limits are currently measured in requests per second (RPS) and tokens per second (TPS) for each model type. If you exceed any of the rate limits you will get a 429 error.
Currently clients need to handle resource creation themselves (e.g. "give me this huggingface resource so I can run my test." This means they have to know the implementation details of the resource they need and we can't easily manage resource allocation, queuing, turning resources off when not needed, etc.
Investigate a proxy that would handle those things for clients, so they can just hit the proxy with a unified interface, and the proxy would handle the details.