huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.12k stars 556 forks source link

Filter by tags with list_datasets() API #2265

Closed jacobmarks closed 6 months ago

jacobmarks commented 6 months ago

Proposed API

The list_datasets() API is incredibly useful, but it currently doesn't allow for filtering by dataset tags. It would be great to add this as an optional argument, in line with author, language, etc:

from huggingface_hub import HfApi

api = HfApi()

# List all datasets with tag "my-tag"
api.list_datasets(tags="my-tag")

# List all datasets with tag "my-tag1" or tag "my-tag2"
api.list_datasets(tags=["my-tag1", "my-tag2"])

Motivating Problem

I am building FiftyOne's Hugging Face Hub integration. There is no notion of a library for datasets on Hugging Face the way there is for models, so we are using the tag fiftyone to signify that a dataset is compatible with FiftyOne.

I want to build a plugin for the integration that would allow people to load a dataset from the Hugging Face Hub into FiftyOne from UI alone, and rather than require users to input the repo_id themselves, it would be ideal to pre-generate a list of available datasets to give the user in an autocomplete.

Workaround

I'm currently sending a request to the URLS on the HF website and processing the results:

from bs4 import BeautifulSoup
import json
import requests

FIFTYONE_HUB_URL_TEMPLATE = (
    "https://huggingface.co/datasets?other=fiftyone&sort=trending&p={i}"
)

def get_fiftyone_hub_datasets():
    i = 0
    all_dataset_ids = []

    while True:
        response = requests.get(FIFTYONE_HUB_URL_TEMPLATE.format(i=i))

        try:
            content = response.content
            soup = BeautifulSoup(content, "lxml")

            div = soup.find_all("div", class_="SVELTE_HYDRATER contents")[2]

            data_props = div.get("data-props")
            if data_props:
                data = json.loads(data_props)
                datasets = data["initialValues"]["datasets"]
                dataset_ids = [dataset["id"] for dataset in datasets]
                print(f"Page {i}: Found {len(dataset_ids)} datasets.")
                if not dataset_ids:
                    break
                all_dataset_ids.extend(dataset_ids)
        except:
            break

        i += 1

    return sorted(all_dataset_ids)
Wauplin commented 6 months ago

Hi @jacobmarks, thanks for opening this feature request! Glad to hear you're building an integration with the Hub :)

So actually filtering by tag is currently supported in list_datasets but not well documented:

>>> from huggingface_hub import list_datasets
>>> datasets = list(list_datasets(filter="fiftyone"))
>>> len(datasets)
12
>>> datasets[0].id
'jamarks/my-action-recognition-dataset'
>>> datasets[0].tags
['language:en', 'license:mit', 'action-recognition', 'fiftyone', 'video', 'region:us']

The filter argument is very versatile and can filter by tags but not only. I opened a PR (https://github.com/huggingface/huggingface_hub/pull/2266) to explicitly add a tags parameter to list_datasets and document it.