Closed redhog closed 1 week ago
In particular, docetl should be tested with the lowest tier of openai account on a machine with good network to make sure nothing funny happens when OpenAI does decide to apply the rate limit :)
My approach to solving this would be to make a lllm_map
method on the DSLRunner that takes an array of dictionaries, each being the arguments to the litellm complete() function, and runs them all, but applies global limits on concurrent calles, total nr of calls per second or total nr of tokens per second, set in the yaml config.
In the past, I’ve successfully used OpenLimit, a tool for managing rate limits across multiple threads and processes for OpenAI-compatible APIs. It offers support for various models, both synchronous and asynchronous, and easily integrates with Redis for distributed requests.
I’m happy to help implement this as a global token rate limiter. Let me know if you’d like assistance in integrating it into the system!
OpenLimit looks interesting but because of LiteLLM, we support LLMs that are not necessary OpenAI-compatible (e.g., Gemini). I like @redhog's idea of allowing the user to set their rate limits in the config. We can probably set these rate limits as a global variable to avoid passing them around all the classes and functions...it's not ideal to have global variables but this may be a special case and ok
I had a look at https://pyratelimiter.readthedocs.io/en/latest/ and it looks promising. It's not specific to OpenAI (or even REST APIs), however, it doesn't work across processes (at least not unless you throw rediss into the mix...).
I'd like to propose sending the DSLRunner around, at least to all ops and parsers. That way, DSLRunner.config is available and global config variables can be extracted. That also adds the possibility to have a DSLRunner.llm object that encapsulates all the functions that are currently in operations.utils as methods...
I'm open to passing the DSLRunner around but I worry about how big this refactor may be with the optimizer class also...
One similar idea (that doesn't involve passing around the DSLRunner) is to have an execution engine class that takes in a config, and has syntax check and run_operation methods (and any other methods that may make sense). Then both the DSLRunner and Optimizer can have execution objects. Since the execution object has the config with the rate limits, the operations can see the rate limits, and then the llm function call can thus see the rate limits.
We can also have an llm class, and each operation can have an instance of this llm object...if that frees us from having to pass around the llm-specific parameters. But maybe this is overkill or could be done in a separate PR.
What do you think?
What would the DSLRunner class have, in addition to what the execution object would have? I have to admit I haven't looked too much at the optimizer code to know what it does internally. Doesn't it just translate yaml to new yaml?
The optimizer class also handles sampling data for optimization, which is different per operator.
I'm imagining the execution object just has syntax check and run_operation methods (which are common to both DSLRunner and Optimizer). It doesn't need to have much; just the ability to see global information about rate limits
The DSLRunner uniquely supports writing out intermediate steps. The Optimizer supports sampling, running the optimizers per operation (sometimes running upstream operations to have enough sample data to optimize a downstream operator), and creating a new optimized config.
Hm, so maybe we do need that split then :S
So, I started implementing this here: https://github.com/ucbepic/docetl/pull/64
The PR is now ready for merge :)
awesome i will take a look later today or tomorrow 🚀
Ok!! We are in!
Even with my changes, throttling of LLM calls is still not ideal. Best would be to be able to specify a global rate of tokens/minute, or at least calls / minute and enforce this.
On the other side, when timeouts and RateLimit errors happen in e.g. a resolve operation, the other calls still continue even if a single call is cancelled. But then the output contains an empty LLM response, and the validation fails. This is clearly suboptimal economically: if we want the pipeline to fail in this situation, it should do it early and not cost more API calls!