huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.03k stars 531 forks source link

KeyError: 'multilinguality' when calling DatasetSearchArguments() #1280

Closed animator closed 1 year ago

animator commented 1 year ago

Describe the bug

KeyError: 'multilinguality' when calling DatasetSearchArguments()

Reproduction

from huggingface_hub import DatasetSearchArguments
dataset_args = DatasetSearchArguments()

Logs

from huggingface_hub import DatasetSearchArguments
dataset_args = DatasetSearchArguments()

Error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/1h/lqt86wdn4nq9z9h_c4q7s4gr0000gn/T/ipykernel_43840/980484241.py in <module>
      1 from huggingface_hub import DatasetSearchArguments
----> 2 dataset_args = DatasetSearchArguments()

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/huggingface_hub/hf_api.py in __init__(self, api)
    548     def __init__(self, api: Optional["HfApi"] = None):
    549         self._api = api if api is not None else HfApi()
--> 550         tags = self._api.get_dataset_tags()
    551         super().__init__(tags)
    552         self._process_models()

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/huggingface_hub/hf_api.py in get_dataset_tags(self)
    669         hf_raise_for_status(r)
    670         d = r.json()
--> 671         return DatasetTags(d)
    672 
    673     @_deprecate_list_output(version="0.14")

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/huggingface_hub/utils/endpoint_helpers.py in __init__(self, dataset_tag_dictionary)
    365             "license",
    366         ]
--> 367         super().__init__(dataset_tag_dictionary, keys)

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/huggingface_hub/utils/endpoint_helpers.py in __init__(self, tag_dictionary, keys)
    298             keys = list(self._tag_dictionary.keys())
    299         for key in keys:
--> 300             self._unpack_and_assign_dictionary(key)
    301 
    302     def _unpack_and_assign_dictionary(self, key: str):

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/huggingface_hub/utils/endpoint_helpers.py in _unpack_and_assign_dictionary(self, key)
    303         "Assignes nested attributes to `self.key` containing information as an `AttributeDictionary`"
    304         setattr(self, key, AttributeDictionary())
--> 305         for item in self._tag_dictionary[key]:
    306             ref = getattr(self, key)
    307             item["label"] = (

KeyError: 'multilinguality'

System info

- huggingface_hub version: 0.11.1
- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.9.6
- Running in iPython ?: Yes
- iPython shell: ZMQInteractiveShell
- Running in notebook ?: Yes
- Running in Google Colab ?: No
- Token path ?: /Users/x/.huggingface/token
- Has saved token ?: False
- Configured git credential helpers: osxkeychain
- FastAI: N/A
- Tensorflow: 2.9.1
- Torch: N/A
- Jinja2: 3.0.1
- Graphviz: N/A
- Pydot: N/A
chawins commented 1 year ago

Same issue here. Same huggingface_hub version.

Wauplin commented 1 year ago

Hi @animator @chawins , thanks for reporting us this issue and sorry for the late reply. The issue comes from a server-side change (search is been revamped). I made a PR https://github.com/huggingface/huggingface_hub/pull/1300 to make the huggingface_hub API more robust to server-side changes.

Overall we have quite low usage of this feature + it's quite some legacy code. At some point it will be completely revisited but in the meantime I hope this fix will be enough for you to use it. Please remember it is mainly meant for exploratory purposes.

(see also related discussion: https://github.com/huggingface/huggingface_hub/pull/1250)

chawins commented 1 year ago

Thanks for the quick response/fix @Wauplin! I tested the updated main branch, and DatasetSearchArguments seems to work now.

I ended up filtering via tags instead which kind of suits my need better too. In case anyone is looking for a similar workaround, here's what I went with:

hf_api = hf_hub.HfApi()
model_args = hf_hub.ModelSearchArguments()

filt = hf_hub.ModelFilter(
    task=model_args.pipeline_tag.ImageClassification,
    library=model_args.library.PyTorch,
)
models = hf_api.list_models(filter=filt)
# hf_hub.DatasetSearchArguments() is buggy so we go with searching
# "imagenet" in tags instead
models = filter(lambda m: any("imagenet" in t for t in m.tags), models)
Wauplin commented 1 year ago

Thanks for the feedback and for sharing the snippet !

Just to clarify it, what ModelSearchArguments does it to provide an helper to find the desired tag. But in the end, model_args.library.PyTorch is strictly the string "pytorch". And DatasetSearchArguments().dataset_name.imagenet IS "imagenet".

So you could also do:

from huggingface_hub import HfApi, ModelFilter

hf_api = HfApi()
models = hf_api.list_models(
    filter=ModelFilter(task="image-classification", library="pytorch", trained_dataset="imagenet")
)

That's what I meant by ModelSearchArguments and DatasetSearchArguments are purely for exploratory purposes. If you already know what you are looking for, you can do the search without using them. It saves you the ~10s is takes to initialize them. Hope that makes it clearer :)