Open osanseviero opened 3 weeks ago
Agree on the need to provide that.
I would rather add the expand
parameter to list_models
/list_datasets
/list_spaces
since these methods already exist but the problem is that the returned value is often very shallow: only the mongo id + the repo id + the expand values are returned, making it unsuitable for ModelInfo
/DatasetInfo
/SpaceInfo
. I'm open to suggestion on how to make this consistent with the existing API.
For the record:
/models
supports [author, cardData, config, createdAt, disabled, downloads, downloadsAllTime, gated, gitalyUid, lastModified, library_name, likes, mask_token, model-index, pipeline_tag, private, safetensors, sha, siblings, spaces, tags, transformersInfo, widgetData]
/datasets
supports [author, cardData, citation, createdAt, disabled, description, downloads, downloadsAllTime, gated, gitalyUid, lastModified, likes, paperswithcode_id, private, siblings, sha, tags]
/spaces
supports [author, cardData, datasets, disabled, gitalyUid, lastModified, createdAt, likes, private, runtime, sdk, siblings, sha, subdomain, tags, models]
(those lists can be updated in the future)
maybe some kind of get_models_attributes(property="downloadsAllTime", ...)
? not sure
Made some tests and to be consistent we should also update model_info
/dataset_info
/space_info
as well (request for a single repo). Thinking twice about it, I would rather reuse the existing methods since all filters/search features remain the same when using expand[]=
.
What I suggest for models:
expand
parameter to model_info
and list_models
that accepts a list of strings (defaults to None)full
/cardData
/fetch_config
and the new expand
parameter are not set at the same time since the API cannot fulfill both.private
/ tags
/ downloads
/ likes
optional in ModelInfo
. At the moment we consider them as always returned by the server which is not the case.downloads_all_time
attribute to ModelInfo
since it'll be a quite common use case.expand
is passed, we forward it to the server. Response is parsed as a ModelInfo
(or list of ModelInfo) which is mostly empty but that's fine. Only the requested fields + id
are populated.>>> from huggingface_hub import model_info
>>> model = model_info("gpt2", expand=["downloadsAllTime"])
>>> model
ModelInfo(id='openai-community/gpt2', author=None, sha=None, created_at=None, last_modified=None, private=None, gated=None, disabled=None, downloads=None, downloads_all_time=585955419, likes=None, library_name=None, tags=None, pipeline_tag=None, mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, siblings=None, spaces=None, safetensors=None)
>>> model.downloads_all_time
585955419
Anticipating the question, I prefer ModelInfo.downloads_all_time
rather than ModelInfo.downloadsAllTime
for consistency with others attributes + following pythonic convention. We already do that for lastModified
, createdAt
, cardData
.
The "only" thing that bothers me with this solution is that it's not explicit to the user which expand
values they can use just looking at the method signature. If they pass an invalid expand
value, the API returns an error with the list of available properties which is already nice. A partial solution to that would be to add correct type annotation + docstring but happy to hear if you have a better idea @osanseviero @julien-c
sounds like a good plan.
For 3., would it make sense to return something else than a ModelInfo
in that case? like ModelInfoProperties
where everything is optional? (having all properties optional in ModelInfo sounds dirty to me but not sure if it makes sense in Python world)
having all properties optional in ModelInfo sounds dirty to me but not sure if it makes sense in Python world
Not very common IMO but that's the path we went for with the ModelInfo class already:
id: str
author: Optional[str]
sha: Optional[str]
created_at: Optional[datetime]
last_modified: Optional[datetime]
private: bool
gated: Optional[Literal["auto", "manual", False]]
disabled: Optional[bool]
downloads: int
likes: int
library_name: Optional[str]
tags: List[str]
pipeline_tag: Optional[str]
mask_token: Optional[str]
card_data: Optional[ModelCardData]
widget_data: Optional[Any]
model_index: Optional[Dict]
config: Optional[Dict]
transformers_info: Optional[TransformersInfo]
siblings: Optional[List[RepoSibling]]
spaces: Optional[List[str]]
safetensors: Optional[SafeTensorsInfo]
Having a few more optional fields would at least make things consistent even though it's not so satisfying. I'll check in parallel how I can draft a ModelInfoProperties
class and see if it makes more sense.
Updated PR for it (https://github.com/huggingface/huggingface_hub/pull/2333) with everything we talked above. It's ready for review :)
This can be done by using the
/models
endpoint and adding?expand:downloadsAllTime
. As it's a frequent request, it might be useful to support this directly in the library. Right now, I can do it with internal utils