huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.99k stars 2.62k forks source link

Incremental dataset (e.g. `.push_to_hub(..., append=True)`) #6290

Open Wauplin opened 11 months ago

Wauplin commented 11 months ago

Feature request

Have the possibility to do ds.push_to_hub(..., append=True).

Motivation

Requested in this comment and this comment. Discussed internally on slack.

Your contribution

What I suggest to do for parquet datasets is to use CommitOperationCopy + CommitOperationDelete from huggingface_hub:

  1. list files
  2. copy files from parquet-0001-of-0004 to parquet-0001-of-0005
  3. delete files like parquet-0001-of-0004
  4. generate + add last parquet file parquet-0005-of-0005

=> make a single commit with all commit operations at once

I think it should be quite straightforward to implement. Happy to review a PR (maybe conflicting with the ongoing "1 commit push_to_hub" PR https://github.com/huggingface/datasets/pull/6269)

ZachNagengast commented 11 months ago

Yea I think waiting for #6269 would be best, or branching from it. For reference, this PR is progressing pretty well which will do similar using the hf hub for our LAION dataset bot https://github.com/LAION-AI/Discord-Scrapers/pull/2.

nqyy commented 1 month ago

Is there any update on this?

Elfsong commented 1 week ago

Is there any update on this?

Wauplin commented 1 week ago

No update so far on this feature request but for broader context, this announce will help with incremental datasets https://huggingface.co/blog/xethub-joins-hf :)