Closed jacobmarks closed 6 months ago
Hi @jacobmarks, thanks for opening this feature request! Glad to hear you're building an integration with the Hub :)
So actually filtering by tag is currently supported in list_datasets
but not well documented:
>>> from huggingface_hub import list_datasets
>>> datasets = list(list_datasets(filter="fiftyone"))
>>> len(datasets)
12
>>> datasets[0].id
'jamarks/my-action-recognition-dataset'
>>> datasets[0].tags
['language:en', 'license:mit', 'action-recognition', 'fiftyone', 'video', 'region:us']
The filter
argument is very versatile and can filter by tags but not only. I opened a PR (https://github.com/huggingface/huggingface_hub/pull/2266) to explicitly add a tags
parameter to list_datasets
and document it.
Proposed API
The
list_datasets()
API is incredibly useful, but it currently doesn't allow for filtering by dataset tags. It would be great to add this as an optional argument, in line withauthor
,language
, etc:Motivating Problem
I am building FiftyOne's Hugging Face Hub integration. There is no notion of a
library
for datasets on Hugging Face the way there is for models, so we are using the tagfiftyone
to signify that a dataset is compatible with FiftyOne.I want to build a plugin for the integration that would allow people to load a dataset from the Hugging Face Hub into FiftyOne from UI alone, and rather than require users to input the
repo_id
themselves, it would be ideal to pre-generate a list of available datasets to give the user in an autocomplete.Workaround
I'm currently sending a request to the URLS on the HF website and processing the results: