iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.63k stars 1.18k forks source link

datasets: delta lake and huggingface #10363

Open dberenbaum opened 6 months ago

dberenbaum commented 6 months ago

Following up on https://github.com/iterative/dvc/issues/10313 and related new features specifying datasets as dependencies, we can add more types of supported datasets:

This could allow for setting these types of datasets as dependencies tracked by dvc using their own native versioning without downloading or caching anything.

Delta Lake example:

from dvc.api import get_dataset

ds_info = get_dataset("mytable")
df = spark.read.format("delta").option("timestampAsOf", ds_info["timestamp"]).table(ds_info["name"])

Hugging Face example:

from dvc.api import get_dataset

ds_info = get_dataset("mydataset")
dataset = load_dataset(ds_info["name"], rev=ds_info["rev"])
skshetry commented 6 months ago

Don't ping random people like this in GitHub issues. And this issue is not very begineer-friendly.