Cache parsed data to optimize loading and sampling

ucbepic / docetl

A system for agentic LLM-powered data processing and ETL

https://docetl.org

MIT License

1.26k stars 114 forks source link

Cache parsed data to optimize loading and sampling #33

Open shreyashankar opened 1 month ago

shreyashankar commented 1 month ago

See PR #32 (thanks @ahmedtadde for the suggestion!)

Currently, we apply parsing tools each time we load or sample data. We could cache the output of self._apply_parsing_tools() for the entire dataset and use this cached data for both loading and sampling operations.

Benefits:

Faster subsequent loads and samples
Reduced computation and potential I/O savings

Considerations:

Increased memory usage
Cache invalidation when dataset or parsing tools change

shreyashankar commented 1 month ago

Check out this file for how we apply parsing.

ahmedtadde commented 1 month ago

hey @shreyashankar , you can assign this to me. No promises on the turn around though; basically feel free to apply a work stealing scheduling strategy here if another person or you decide to take this before I get a draft PR out.

shreyashankar commented 1 month ago

thank you for taking this on! at a glance i think we will want to use a disk cache, so data persists between pipeline runs. for example, we use DiskCache here

no rush on the timeline; your contributions are much appreciated 😊