Open shreyashankar opened 2 weeks ago
We now support pointing to other data sources in the JSON file, thanks to #32.
We should also support JSONs in the cloud, but this is lower priority. People can load data in the cloud using a custom tool/parser.
Right now, DocETL only works with JSON files as input. We need to broaden its capabilities to handle various data types and sources, making it more flexible and easier to use.
Goal
Build a versatile
Dataset
class that can work with different input types and sources. This new class should integrate smoothly with both the executor (runner.py
) and optimizer.To-Do List
Dataset
class indocetl/dataset.py
:runner.py
to use the newDataset
class:builder.py
) to work with the newDataset
class:Dataset
class and its integration(Proposed) Config Example
Notes
fsspec
- it works with different storage systems