huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.72k stars 2.58k forks source link

Allow downloading just some columns of a dataset #4114

Open osanseviero opened 2 years ago

osanseviero commented 2 years ago

Is your feature request related to a problem? Please describe. Some people are interested in doing label analysis of a CV dataset without downloading all the images. Downloading the whole dataset does not always makes sense for this kind of use case

Describe the solution you'd like Be able to just download some columns of a dataset, such as doing

load_dataset("huggan/wikiart",columns=["artist", "genre"])

Although this might make things a bit complicated in terms of local caching of datasets.

lhoestq commented 2 years ago

In the general case you can’t always reduce the quantity of data to download, since you can’t parse CSV or JSON data without downloading the whole files right ? ^^ However we could explore this case-by-case I guess

osanseviero commented 2 years ago

Actually for csv pandas has usecols which allows loading a subset of columns in a more efficient way afaik, but yes, you're right this might be more complex than I thought.

lukasugar commented 4 months ago

Bumping the visibility of this :) Is there a recommended way of doing this?

lhoestq commented 4 months ago

Passing columns=[...] to load_dataset() in streaming mode does work if the dataset is in Parquet format, but for other formats it's either not possible or not implemented

oza75 commented 2 months ago

I tried using the columns=['bambara'] on this dataset oza75/bambara-tts which is in parquet, but it does not work. This feature is really useful because sometimes you don't want to download the whole dataset but just a few columns.

Ravi2712 commented 1 month ago

It doesn't work for the dataset with parquet format. Are we missing something?

lhoestq commented 1 month ago

It only works for streaming=True. When not streaming it does download the full files locally before reading the data