Open shcheklein opened 6 days ago
It's quite common that I these days do this:
if "dclm-raw-text" not in datasets: ( DataChain.from_dataset("dclm-index") .settings(cache=True) .limit(1) .gen(extract, output={"file": File, "json": dict}) .save("dclm-raw-text") )
to avoid running that code again if the dataset is ready.
The downside is that I still need to run it from time to time (e.g. I change params, or something changed at it's source - dclm-index in this case).
dclm-index
I think we can make save() analyze the dependencies (including the query) and avoid running (by a flag or default?).
save()
It brings a great additional value compared to basic data processing libs - our ability to analyze the graph of dependencies.
It seems like a good problem to solve. AT the same time it feels that it will require a higher level abstraction like "task" or "step". It would be great to brainstorm this a bit.
It's quite common that I these days do this:
to avoid running that code again if the dataset is ready.
The downside is that I still need to run it from time to time (e.g. I change params, or something changed at it's source -
dclm-index
in this case).I think we can make
save()
analyze the dependencies (including the query) and avoid running (by a flag or default?).It brings a great additional value compared to basic data processing libs - our ability to analyze the graph of dependencies.