ucbepic / docetl

A system for agentic LLM-powered data processing and ETL
https://docetl.org
MIT License
934 stars 87 forks source link

Lineage #94

Open shreyashankar opened 1 week ago

shreyashankar commented 1 week ago

Reduce Operation Lineage

From discord:

One use case I'm really interested in is [pre]computing a set of "reports" / outputs from a large set of documents, and then being able to reuse that computation when I filter documents to the applicable reports that have only those documents as "Sources"

i.e

if full corpus is a, b, c, d, e, f -> generates reports 1 (a, b, c) + 2 (b, c, d) + 3 (c, d, e) + 4 (d, e, f) and then I want to see the "reports" contributed by docs d, e, f = 2,3,4

My proposal is to support a lineage param in the output, e.g.,

name: opname
type: reduce
reduce_key: ...
prompt: ...
output:
  schema: ...
  lineage:
    - keyname1
    - keyname2

then for every document in the output, there should be a key opname_lineage with a list of kv pairs for all the keys in lineage, for all documents in the group that the output document was derived from.

Querying Pipeline Lineage

It would be nice to log all the pipeline lineage to sqlite & have users be able to query it (e.g., find all the reports contributed by certain upstream/input docs). We'd have to think of a good data model & query patterns.

garuna-m6 commented 3 days ago

@shreyashankar took some more time that thought to get the OpenAI keys :( , trying to understand the issue here, we need tracing in logs for lineage reduce operations (don't want the sql setup anywhere in pipeline). With existing verbose functionality have logging like reduce : lineage keys [if used] : reduce operation output in logging 👀 ? Would need some guidance

shreyashankar commented 2 days ago

No worries!

I think the logging can be set at a pipeline level; in the top level of the config someone can specify the path to store a sqlite db of the logs; then, we can add ids to each document in the input and pass them through each operation in the pipeline.

For each operation, we could create a table of the outputs, with an additional "id" column. We could also create a dependency table for each operation to link the operation's outputs with the id(s) of its inputs:

CREATE TABLE {operation}_dependencies (
    dependent_id INTEGER REFERENCES dependent_table(id),
    main_id INTEGER REFERENCES main_table(id),
    PRIMARY KEY (dependent_id, main_id)
);

So, each operation has its own output table, as well as a dependencies table. This can enable both forward and backwards tracing.

garuna-m6 commented 1 day ago

Sorry for asking explanations as a 5 year old, but docetl pipeline would run on demand, the expectation here is to start a sqlite local server if set in config, put all the logs in the db then close the pipeline shutting down server :/ or dump the logs for a sql server to read or are we expecting the server connection files are present?

shreyashankar commented 1 day ago

No worries, sorry for the confusion! Sqlite doesn't require a separate server process: https://docs.python.org/3/library/sqlite3.html

So if the user specifies a path for the sqlite db in the config, we can create a db and populate relevant tables as the outputs are created.

redhog commented 1 day ago

Is there a big reason to keep the lineage data out-of-bound?

I'd rather save lineage info inside the items, so that an outside system that gets the final output dump, has access to it directly (without a join). What's the drawback of doing that?

I think sqlite output is interesting in the context of https://github.com/ucbepic/docetl/issues/104 btw :)

Also, potentially for storing the intermediate data more efficiently?

shreyashankar commented 20 hours ago

I think saving it to a database makes it significantly more queryable...otherwise constructing forward traces will involve a bunch of for loops to go through the outputs and see which ids contain the source id. Similarly constructing a backwards traces will require lots of wrangling.

redhog commented 20 hours ago

Well, that depends on what happens with the output. If it's just a json, yes. But if you insert it into something like elastic-search, then having the metadata / lineage inline is super useful. So maybe both?

If we had output plugins, and could write multiple outputs with different plugins, then this could be handled at the output stage:

pipeline:
  steps: ...
  output;
    - json:
      path: my-pipeline-output.json
    - sqlite:
      path: metadata.sqlite
      keys:
        - source-file
        - page
     - elasticsearch:
       url: http://localhost:9200/