iterative / datachain

AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps
https://docs.datachain.ai
Apache License 2.0
710 stars 39 forks source link

New persist() method #361

Open dreadatour opened 2 weeks ago

dreadatour commented 2 weeks ago

Follow-up for the https://github.com/iterative/datachain/issues/327

Sometimes it is useful to save intermediate chain state, because operations are lazy, chains are not executed immediately and intermediate results are not stored.

For example, if we want to create dc_filtered_1 and dc_embeddings from dc, without saving intermediate dc chain will be executed twice, for each children.

It is possible to do it with save() method without name param, also we have exec() method, but it looks like persist() is better and more verbose name for this method.

After persist() method will be implemented, we may want to make name param in save() method mandatory.

mattseddon commented 2 weeks ago

How about materialise instead of persist? Just a suggestion.

rlamy commented 2 weeks ago

.persist() is the name of the method in the dataframe API standard. I think that's what we should use - assuming it works exactly as described in the standard.