huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

Delta Tables usage using Datasets Library #5219

Open reichenbch opened 2 years ago

reichenbch commented 2 years ago

Feature request

Adding compatibility of Datasets library with Delta Format. Elevating the utilities of Datasets library from Machine Learning Scope to Data Engineering Scope as well.

Motivation

We know datasets library can absorb csv, json, parquet, etc. file formats but it would be great if Datasets library could work with Delta Tables (with delta format) as it has different features such as time travelling, layout optimization, query performance, aids in Data Engineering.

This will help and enhance Datasets library from Machine Learning utility to Data Engineering utilities and expand horizons thereafter. I am totally using Datasets library in all my usecases and as my role expands so does the work, compatibility with Datasets library is something I don't want to lose.

Your contribution

Would love to work on this feature, even if this has to picked up from scratch, including design paradigms and patterns. I have basic idea about Delta Live Tables, would brush it easily for this feature.

lhoestq commented 2 years ago

Hi ! Interesting :) Can you provide concrete examples of cases where it can be useful ?

reichenbch commented 1 year ago

Few example blogs and posts that might help on this -

  1. https://hevodata.com/learn/databricks-delta-tables/
  2. https://docs.databricks.com/delta/index.html

Basically, we are looking at utility of Datasets library with Delta Lake Tables.

lhoestq commented 1 year ago

datasets can already read/write from parquet from/to a cloud storage using fsspec, if I understand correctly it's should be possible to load parquet files as delat lake tables no ? :) Or is there someting missing ?

zhenyu commented 1 year ago

@lhoestq Per my understanding, delta lake table is a bunch of paruqet files together with the meta to support ACID. For example file 1 contains v0.1 of record A while file 2 contains v0.2 of record A. I am assuming the Hugging face dataset would delegate the read/write delta table to 3rd party lib, maybe pyarrow. Correct me if I was wrong @reichenbch

And I am assuming, people are asking the versioning of Hugging face datasets. But I am assuming Hugging face delegate this function to github and it is not the key requirement for Public Data set. It actually the key function of ML Ops, I am not sure whether hugging face would like expand to that area.