Support Various Input Data Types and Sources

Right now, DocETL only works with JSON files as input. We need to broaden its capabilities to handle various data types and sources, making it more flexible and easier to use.

Goal

Build a versatile Dataset class that can work with different input types and sources. This new class should integrate smoothly with both the executor (runner.py) and optimizer.

To-Do List

Set up a new Dataset class in docetl/dataset.py:
- Handle local files and folders
- Support cloud storage (S3, GCS, etc.)
- Work with different file types (JSON, CSV, YAML, etc.)
Update runner.py to use the new Dataset class:
- Switch out the current data loading method
- Make sure it plays nice with existing pipeline setups
Tweak the optimizer (builder.py) to work with the new Dataset class:
- Update any dataset-related bits in the optimizer
Modify the YAML config format for new dataset types:
- Add fields for dataset type, source, and format
- Keep it backwards-compatible
Write unit tests for the new Dataset class and its integration
Update the docs:
- Add examples using different dataset types in the tutorial
- Refresh the API docs where needed

(Proposed) Config Example

datasets:
  user_logs:
    type: file
    source: local
    format: json
    path: "user_logs.json"
  product_data:
    type: folder
    source: s3
    format: csv
    path: "s3://my-bucket/product-data/"

Notes

Good first issue for someone who knows Python and data handling
For cloud storage, consider fsspec - it works with different storage systems

ucbepic / docetl

Support Various Input Data Types and Sources #2

Goal

To-Do List

(Proposed) Config Example

Notes