ucbepic / docetl

A system for complex LLM-powered document processing
https://docetl.org
MIT License
713 stars 70 forks source link

Support Various Input Data Types and Sources #2

Open shreyashankar opened 2 weeks ago

shreyashankar commented 2 weeks ago

Right now, DocETL only works with JSON files as input. We need to broaden its capabilities to handle various data types and sources, making it more flexible and easier to use.

Goal

Build a versatile Dataset class that can work with different input types and sources. This new class should integrate smoothly with both the executor (runner.py) and optimizer.

To-Do List

  1. Set up a new Dataset class in docetl/dataset.py:
    • Handle local files and folders
    • Support cloud storage (S3, GCS, etc.)
    • Work with different file types (JSON, CSV, YAML, etc.)
  2. Update runner.py to use the new Dataset class:
    • Switch out the current data loading method
    • Make sure it plays nice with existing pipeline setups
  3. Tweak the optimizer (builder.py) to work with the new Dataset class:
    • Update any dataset-related bits in the optimizer
  4. Modify the YAML config format for new dataset types:
    • Add fields for dataset type, source, and format
    • Keep it backwards-compatible
  5. Write unit tests for the new Dataset class and its integration
  6. Update the docs:
    • Add examples using different dataset types in the tutorial
    • Refresh the API docs where needed

(Proposed) Config Example

datasets:
  user_logs:
    type: file
    source: local
    format: json
    path: "user_logs.json"
  product_data:
    type: folder
    source: s3
    format: csv
    path: "s3://my-bucket/product-data/"

Notes

shreyashankar commented 1 day ago

We now support pointing to other data sources in the JSON file, thanks to #32.

We should also support JSONs in the cloud, but this is lower priority. People can load data in the cloud using a custom tool/parser.