Open osanseviero opened 2 years ago
In the general case you can’t always reduce the quantity of data to download, since you can’t parse CSV or JSON data without downloading the whole files right ? ^^ However we could explore this case-by-case I guess
Actually for csv pandas has usecols
which allows loading a subset of columns in a more efficient way afaik, but yes, you're right this might be more complex than I thought.
Bumping the visibility of this :) Is there a recommended way of doing this?
Passing columns=[...]
to load_dataset()
in streaming mode does work if the dataset is in Parquet format, but for other formats it's either not possible or not implemented
I tried using the columns=['bambara']
on this dataset oza75/bambara-tts
which is in parquet, but it does not work. This feature is really useful because sometimes you don't want to download the whole dataset but just a few columns.
It doesn't work for the dataset with parquet
format. Are we missing something?
It only works for streaming=True
. When not streaming it does download the full files locally before reading the data
Is your feature request related to a problem? Please describe. Some people are interested in doing label analysis of a CV dataset without downloading all the images. Downloading the whole dataset does not always makes sense for this kind of use case
Describe the solution you'd like Be able to just download some columns of a dataset, such as doing
Although this might make things a bit complicated in terms of local caching of datasets.