pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.23k stars 17.79k forks source link

ENH: Integration with Hugging Face Hub #46000

Open lvwerra opened 2 years ago

lvwerra commented 2 years ago

Hi Pandas devs and Pandas community 🤗

I am reaching out to you to see if you would be interested in an integration with the Hugging Face Hub. We have been hosting datasets on the hub for a while and are now close to 3000 public datasets not counting all the private datasets.

In both the models and datasets areas of the Hugging Face ecosystem we use the push_to_hub functionality to upload datasets and models to the Hub in one line. Similarly, these assets can be loaded from the Hub in a single line with the load_dataset and from_pretrained functions, respectively.

We wanted to ask you whether you would be interested to add the huggingface_hub dependancy such that any DataFrame could be pushed and pulled from the hub.

Here are a few use-cases where such a functionality would add value:

Here is how such an integration could look like:

# upload a DataFrame to the Hub:
df.push_to_hub("my_dataset", org="my_org")

# load a DataFrame from the Hub:
df = DataFrame.from_hub("my_dataset", org="my_org")

Here is the documentation on publishing files on the Hugging Face Hub using the huggingface_hub library: https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub#publish-files-to-the-hub

I am curious to hear what you think about this and please let me know if I can clarify anything!

cc @osanseviero @julien-c

jbrockmendel commented 2 years ago

We wanted to ask you whether you would be interested to add the huggingface_hub

We're very wary of adding dependencies and extending an already-overstuffed API. Is something like your_module.push_to_hub(df, "my_dataset", org="my_org") not viable?

mroeschke commented 2 years ago

Agreed with the hesitancy adding this directly in pandas.

For context, pandas-datareader (similar spirit public/private data sourcing feature) used to be packaged with pandas but was spun off into its own package: https://pandas-datareader.readthedocs.io/en/latest/

Given that, I think this would be best implemented as a third party package and included in the ecosystem docs.

twoertwein commented 2 years ago

Pandas already supports many protocols thanks to fsspec (writing/loading to AWS, GCS, ...). If you manage to integrate the "Hugging Face Hub protocol" in fsspec, you get pandas support for free :)

edit: this would take care of the transmission from a user to your hub, but the format might not be what you want (unless you are fine with a csv/json/pickle/excel version of a dataframe).

julien-c commented 2 years ago

@twoertwein that's a pretty cool idea!

lhoestq commented 2 months ago

You can now find some early documentation on hf:// + pandas here: https://huggingface.co/docs/hub/datasets-pandas :)

import pandas as pd

df = pd.read_parquet("hf://datasets/username/my_dataset/data.parquet")

And automatic code snippets on HF as well:

image image