huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
1.82k stars 470 forks source link

Support getting downloads from all time #2325

Open osanseviero opened 3 weeks ago

osanseviero commented 3 weeks ago

This can be done by using the /models endpoint and adding ?expand:downloadsAllTime. As it's a frequent request, it might be useful to support this directly in the library. Right now, I can do it with internal utils

from huggingface_hub.utils._pagination import paginate
api_path = "https://huggingface.co/api/models"
models = list(paginate(api_path, params={"expand": "downloadsAllTime"}, headers={}))
Wauplin commented 3 weeks ago

Agree on the need to provide that.

I would rather add the expand parameter to list_models/list_datasets/list_spaces since these methods already exist but the problem is that the returned value is often very shallow: only the mongo id + the repo id + the expand values are returned, making it unsuitable for ModelInfo/DatasetInfo/SpaceInfo. I'm open to suggestion on how to make this consistent with the existing API.

For the record:

(those lists can be updated in the future)

julien-c commented 3 weeks ago

maybe some kind of get_models_attributes(property="downloadsAllTime", ...)? not sure

Wauplin commented 3 weeks ago

Made some tests and to be consistent we should also update model_info/dataset_info/space_info as well (request for a single repo). Thinking twice about it, I would rather reuse the existing methods since all filters/search features remain the same when using expand[]=.

What I suggest for models:

  1. we add a expand parameter to model_info and list_models that accepts a list of strings (defaults to None)
  2. we check that existing parameters like full/cardData/fetch_config and the new expand parameter are not set at the same time since the API cannot fulfill both.
  3. we make private / tags / downloads / likes optional in ModelInfo. At the moment we consider them as always returned by the server which is not the case.
  4. we had downloads_all_time attribute to ModelInfo since it'll be a quite common use case.
  5. when expand is passed, we forward it to the server. Response is parsed as a ModelInfo (or list of ModelInfo) which is mostly empty but that's fine. Only the requested fields + id are populated.
>>> from huggingface_hub import model_info

>>> model = model_info("gpt2", expand=["downloadsAllTime"])
>>> model
ModelInfo(id='openai-community/gpt2', author=None, sha=None, created_at=None, last_modified=None, private=None, gated=None, disabled=None, downloads=None, downloads_all_time=585955419, likes=None, library_name=None, tags=None, pipeline_tag=None, mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, siblings=None, spaces=None, safetensors=None)
>>> model.downloads_all_time
585955419

Anticipating the question, I prefer ModelInfo.downloads_all_time rather than ModelInfo.downloadsAllTime for consistency with others attributes + following pythonic convention. We already do that for lastModified, createdAt, cardData.

The "only" thing that bothers me with this solution is that it's not explicit to the user which expand values they can use just looking at the method signature. If they pass an invalid expand value, the API returns an error with the list of available properties which is already nice. A partial solution to that would be to add correct type annotation + docstring but happy to hear if you have a better idea @osanseviero @julien-c

julien-c commented 3 weeks ago

sounds like a good plan.

For 3., would it make sense to return something else than a ModelInfo in that case? like ModelInfoProperties where everything is optional? (having all properties optional in ModelInfo sounds dirty to me but not sure if it makes sense in Python world)

Wauplin commented 3 weeks ago

having all properties optional in ModelInfo sounds dirty to me but not sure if it makes sense in Python world

Not very common IMO but that's the path we went for with the ModelInfo class already:

    id: str
    author: Optional[str]
    sha: Optional[str]
    created_at: Optional[datetime]
    last_modified: Optional[datetime]
    private: bool
    gated: Optional[Literal["auto", "manual", False]]
    disabled: Optional[bool]
    downloads: int
    likes: int
    library_name: Optional[str]
    tags: List[str]
    pipeline_tag: Optional[str]
    mask_token: Optional[str]
    card_data: Optional[ModelCardData]
    widget_data: Optional[Any]
    model_index: Optional[Dict]
    config: Optional[Dict]
    transformers_info: Optional[TransformersInfo]
    siblings: Optional[List[RepoSibling]]
    spaces: Optional[List[str]]
    safetensors: Optional[SafeTensorsInfo]

Having a few more optional fields would at least make things consistent even though it's not so satisfying. I'll check in parallel how I can draft a ModelInfoProperties class and see if it makes more sense.

Wauplin commented 6 days ago

Updated PR for it (https://github.com/huggingface/huggingface_hub/pull/2333) with everything we talked above. It's ready for review :)