Open shreyashankar opened 1 month ago
@shreyashankar took some more time that thought to get the OpenAI keys :( , trying to understand the issue here, we need tracing in logs for lineage reduce operations (don't want the sql setup anywhere in pipeline). With existing verbose functionality have logging like reduce : lineage keys [if used] : reduce operation output
in logging 👀 ? Would need some guidance
No worries!
I think the logging can be set at a pipeline level; in the top level of the config someone can specify the path to store a sqlite db of the logs; then, we can add ids to each document in the input and pass them through each operation in the pipeline.
For each operation, we could create a table of the outputs, with an additional "id" column. We could also create a dependency table for each operation to link the operation's outputs with the id(s) of its inputs:
CREATE TABLE {operation}_dependencies (
dependent_id INTEGER REFERENCES dependent_table(id),
main_id INTEGER REFERENCES main_table(id),
PRIMARY KEY (dependent_id, main_id)
);
So, each operation has its own output table, as well as a dependencies table. This can enable both forward and backwards tracing.
Sorry for asking explanations as a 5 year old, but docetl pipeline would run on demand, the expectation here is to start a sqlite local server if set in config, put all the logs in the db then close the pipeline shutting down server :/ or dump the logs for a sql server to read or are we expecting the server connection files are present?
No worries, sorry for the confusion! Sqlite doesn't require a separate server process: https://docs.python.org/3/library/sqlite3.html
So if the user specifies a path for the sqlite db in the config, we can create a db and populate relevant tables as the outputs are created.
Is there a big reason to keep the lineage data out-of-bound?
I'd rather save lineage info inside the items, so that an outside system that gets the final output dump, has access to it directly (without a join). What's the drawback of doing that?
I think sqlite output is interesting in the context of https://github.com/ucbepic/docetl/issues/104 btw :)
Also, potentially for storing the intermediate data more efficiently?
I think saving it to a database makes it significantly more queryable...otherwise constructing forward traces will involve a bunch of for loops to go through the outputs and see which ids contain the source id. Similarly constructing a backwards traces will require lots of wrangling.
Well, that depends on what happens with the output. If it's just a json, yes. But if you insert it into something like elastic-search
, then having the metadata / lineage inline is super useful. So maybe both?
If we had output plugins, and could write multiple outputs with different plugins, then this could be handled at the output stage:
pipeline:
steps: ...
output;
- json:
path: my-pipeline-output.json
- sqlite:
path: metadata.sqlite
keys:
- source-file
- page
- elasticsearch:
url: http://localhost:9200/
whoops, sorry I missed this. I like your operator spec, but I think supporting an elastic search integration as a plugin can be done later down the line. most people use DocETL locally, and I think the sqlite interface is a great start for them
Reduce Operation Lineage
From discord:
One use case I'm really interested in is [pre]computing a set of "reports" / outputs from a large set of documents, and then being able to reuse that computation when I filter documents to the applicable reports that have only those documents as "Sources"
i.e
if full corpus is a, b, c, d, e, f -> generates reports 1 (a, b, c) + 2 (b, c, d) + 3 (c, d, e) + 4 (d, e, f) and then I want to see the "reports" contributed by docs d, e, f = 2,3,4
My proposal is to support a
lineage
param in the output, e.g.,then for every document in the output, there should be a key
opname_lineage
with a list of kv pairs for all the keys in lineage, for all documents in the group that the output document was derived from.Querying Pipeline Lineage
It would be nice to log all the pipeline lineage to sqlite & have users be able to query it (e.g., find all the reports contributed by certain upstream/input docs). We'd have to think of a good data model & query patterns.