ucbepic / docetl

A system for agentic LLM-powered data processing and ETL
https://docetl.org
MIT License
1.1k stars 99 forks source link

Question: Inneficient map #86

Open redhog opened 2 weeks ago

redhog commented 2 weeks ago

So... I have a large dataset I want to run a map operation over, where the relevant key for each item is small. For example, the key might contain some single line of text, and the map might use a prompt to determine if this text is likely to be the name of a person.

The cost and time of running many llm calls ends up pretty high, while it should theoretically be possible to batch the values into say 100 values at a time, and use a slightly more complex prompt that outputs a list of true/false instead of just a single boolean value.

I can't see a way to do this in docetl currently, but it shouldn't be too hard to implement as a pair of operations: One that batches items into groups of N items, and one that takes such a group that might also have other keys with lists of the same length, merges in the values of those extra keys, and then flattens the whole thing into just the list of items again.

This could potentially be used by an optimization rule to mape map operations faster if the above case is detected.

shreyashankar commented 2 weeks ago

Agreed, this is what I was trying to get at with #7 but was not clear enough.

I wonder if a cleaner solution is to implement a batchmap operation, which takes in a prompt template and batch size, and does the batching & flattening you mentioned. Two operations may be excessive...

I really like the idea of automatically determining this in the optimizer (e.g., ideal batch size)

redhog commented 2 weeks ago

A batchmap is probably the way to go yes. Unfortunately, the same problem does come up in filter and cluster... So what about them?

shreyashankar commented 2 weeks ago

Good point. maybe there's a way to set batch to true (or a size) in each of the 3 operation, and treat the prompt as a batch prompt if batching is detected? No need to introduce new operators then

redhog commented 2 weeks ago

Dug a bit more, and what I'd like to propose is to replace

        with ThreadPoolExecutor(max_workers=self.max_batch_size) as executor:
            futures = [executor.submit(_process_map_item, item) for item in input_data]

in each operation, with a single call to a APIWrapper.call_llm_with_batching() that takes a list of dicts with arguments like what call_llm does now.

That way all the batching code can be generalize inside the APIWrapper, and the operations just need to send on the batch size parameter...

How does that sound? Would you be able to do this? I'm a bit unsure how to handle the interaction between this and gleaning (and validation / parsing for that sake)...

shreyashankar commented 2 weeks ago

The reason we didn't bake in validation into call_llm is because validation is at the operation output level, not the LLM call level (e.g., reduce and resolve have multiple llm calls orchestrated).

I like having a batch_call_llm method in API wrapper, but we may want different batching & parsing logic for each operation (e.g., for filter, we might instruct the LLM to return IDs or numbers of the documents that pass the filter; for map we may instruct the LLM to give an output per document).

For validation + gleaning: I'll refactor the gleaning function to to operate directly on the outputs. then if gleaning is enabled, i'll send document + call_llm or batch_call_llm output pairs (with the gleaning config) to the gleaning function, for each document. Similarly, i'll create a validation + retry function that operates directly on outputs, for each output/document instead of batch.

I'll create a PR in the next couple of days with this proposal.