Open shreyashankar opened 1 month ago
Check out this file for how we apply parsing.
hey @shreyashankar , you can assign this to me. No promises on the turn around though; basically feel free to apply a work stealing scheduling strategy here if another person or you decide to take this before I get a draft PR out.
thank you for taking this on! at a glance i think we will want to use a disk cache, so data persists between pipeline runs. for example, we use DiskCache
here
no rush on the timeline; your contributions are much appreciated 😊
See PR #32 (thanks @ahmedtadde for the suggestion!)
Currently, we apply parsing tools each time we load or sample data. We could cache the output of
self._apply_parsing_tools()
for the entire dataset and use this cached data for both loading and sampling operations.Benefits:
Considerations: