Open dberenbaum opened 6 months ago
Following up on https://github.com/iterative/dvc/issues/10313 and related new features specifying datasets as dependencies, we can add more types of supported datasets:
datasets
This could allow for setting these types of datasets as dependencies tracked by dvc using their own native versioning without downloading or caching anything.
Delta Lake example:
from dvc.api import get_dataset ds_info = get_dataset("mytable") df = spark.read.format("delta").option("timestampAsOf", ds_info["timestamp"]).table(ds_info["name"])
Hugging Face example:
from dvc.api import get_dataset ds_info = get_dataset("mydataset") dataset = load_dataset(ds_info["name"], rev=ds_info["rev"])
Don't ping random people like this in GitHub issues. And this issue is not very begineer-friendly.
Following up on https://github.com/iterative/dvc/issues/10313 and related new features specifying
datasets
as dependencies, we can add more types of supported datasets:This could allow for setting these types of datasets as dependencies tracked by dvc using their own native versioning without downloading or caching anything.
Delta Lake example:
Hugging Face example: