huggingface / huggingface_hub

The official Python client for the Huggingface Hub.
https://huggingface.co/docs/huggingface_hub
Apache License 2.0
2.02k stars 531 forks source link

Datasets with invalid repo_id #2289

Closed Taniya-Das closed 4 months ago

Taniya-Das commented 4 months ago

Describe the bug

We are fetching huggingface datasets and validating the repo_id as suggested in huggingface_validators. However, we came across some datasets on huggingface which have invalid repo_id (using the above guidelines). We wanted to check why those datasets are not following the guidelines.

Here are some examples of such datasets:

"gorkaartola/ZS-train_SDG_Descriptions_S1-sentence_S2-SDGtitle_Negative_Sample_Filter-Only_Targets","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"gorkaartola/ZS-train_SDG_Descriptions_S1-sentence_S2-SDGtitle_Negative_Sample_Filter-Only_Indicators","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-staging-eval-project-pnr-svc__Turkish-Multiclass-Dataset-e6effb88-11345510","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-staging-eval-project-autoevaluate__zero-shot-classification-sample-c8bb9099-11","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-staging-eval-project-autoevaluate__zero-shot-classification-sample-18ef74e8-21","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"carlosejimenez/mscoco_train_2014_openai_clip-vit-base-patch32_image_caption_retrieval_pairs_2022-09-01","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"gorkaartola/ZS-train_SDG_Descriptions_S1-sentence_S2-SDGtitle_Negative_Sample_Filter-Only_Title_and_Headline","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-staging-eval-autoevaluate__zero-shot-classification-sample-autoevalu-a8cade-61","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-staging-eval-autoevaluate__zero-shot-classification-sample-autoevalu-40d85c-155","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"carlosejimenez/mscoco_train_2014_openai_clip-vit-base-patch32_image_image_retrieval_pairs_2022-09-13","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-eval-HadiPourmousa__TextSummarization-HadiPourmousa__TextSum-31dfb4-1463253931","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-eval-HadiPourmousa__TextSummarization-HadiPourmousa__TextSum-31dfb4-1463253932","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-staging-eval-autoevaluate__zero-shot-classification-sample-autoevalu-ef9f85-16606242","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-staging-eval-autoevaluate__zero-shot-classification-sample-autoevalu-1a41e5-16746268","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"carlosejimenez/mscoco_train_2014_openai_clip-vit-base-patch32_image_image_retrieval_pairs_2022-09-15","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"carlosejimenez/cc12m_openai-clip-vit-patch32_image_retrieval_top15_start1000000_end3500000_SHORT500K","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-staging-eval-autoevaluate__zero-shot-classification-sample-autoevalu-acab52-16766274","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-staging-eval-Tristan__zero_shot_classification_test-Tristan__zero_sh-3c39f7-16776275","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-staging-eval-Tristan__zero_shot_classification_test-Tristan__zero_sh-997db8-16786276","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-eval-autoevaluate__zero-shot-classification-sample-autoevalu-912bbb-1484454284","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"
"autoevaluate/autoeval-eval-autoevaluate__zero-shot-classification-sample-autoevalu-c3526e-1484354283","1_validation_error_for_DatasetCreate_platform_resource_identifier___The_platform_resource_identifier_for_HuggingFace_should_be_a_valid_repo_id__A_repo_id_should_be_between_1_and_96_characters___type_value_error_"

Reproduction

No response

Logs

No response

System info

- huggingface_hub version: 0.20.3
- Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
- Python version: 3.11.0rc1
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/taniya_das/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: N/A
- Jinja2: 3.1.3
- Graphviz: N/A
- Pydot: N/A
- Pillow: N/A
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: N/A
- pydantic: 1.10.15
- aiohttp: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/taniya_das/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/taniya_das/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/taniya_das/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
Wauplin commented 4 months ago

Hi @Taniya-Das, thanks for reporting. Could you share which code you're using and what is the exact error? I tried to reproduce the error locally but couldn't:

Get dataset_info (validates repo_id internally) => no error

>>> from huggingface_hub import dataset_info

>>> dataset_info("gorkaartola/ZS-train_SDG_Descriptions_S1-sentence_S2-SDGtitle_Negative_Sample_Filter-Only_Targets")
DatasetInfo(id='gorkaartola/ZS-train_SDG_Descriptions_S1-sentence_S2-SDGtitle_Negative_Sample_Filter-Only_Targets', author='gorkaartola', ...)

Explicitly validate repo_id => no error

>>> from huggingface_hub.utils import validate_repo_id

>>> validate_repo_id("gorkaartola/ZS-train_SDG_Descriptions_S1-sentence_S2-SDGtitle_Negative_Sample_Filter-Only_Targets")
```")
Taniya-Das commented 4 months ago

Hi @Wauplin,

Thank You for looking at it. The error we see is - A_repo_id_should_be_between_1_and_96_characters___type_value_error. The length of repo_id exceeded 96 characters.

Here is our code for the validator:

import re

REPO_ID_ILLEGAL_CHARACTERS = re.compile(r"[^0-9a-zA-Z-_./]+")
MSG_PREFIX = "The platform_resource_identifier for HuggingFace should be a valid repo_id. "

def throw_error_on_invalid_identifier(platform_resource_identifier: str):
    """
    Throw a ValueError on an invalid repository identifier.

    Valid repo_ids:
        Between 1 and 96 characters.
        Either “repo_name” or “namespace/repo_name”
        [a-zA-Z0-9] or ”-”, ”_”, ”.”
        ”—” and ”..” are forbidden

    Refer to:
    https://huggingface.co/docs/huggingface_hub/package_reference/utilities#huggingface_hub.utils.validate_repo_id
    """
    repo_id = platform_resource_identifier
    if REPO_ID_ILLEGAL_CHARACTERS.search(repo_id):
        msg = "A repo_id should only contain [a-zA-Z0-9] or ”-”, ”_”, ”.”"
        raise ValueError(MSG_PREFIX + msg)
    if not (1 < len(repo_id) < 96):
        msg = "A repo_id should be between 1 and 96 characters."
        raise ValueError(MSG_PREFIX + msg)
    if repo_id.count("/") > 1:
        msg = (
            "For new repositories, there should be a single forward slash in the repo_id ("
            "namespace/repo_name). Legacy repositories are without a namespace. This repo_id has "
            "too many forward slashes."
        )
        raise ValueError(MSG_PREFIX + msg)
    if ".." in repo_id:
        msg = "A repo_id may not contain multiple consecutive dots."
        raise ValueError(MSG_PREFIX + msg)
Wauplin commented 4 months ago

Oh I see. But what's suggested in these docs is to reuse huggingface_hub.utils.validate_repo_id, not reimplement it. If you are interested in implementation details, you can check out the source here.

For the record, the 96-characters limit is on the repo name, not the repo id. In general, the repo_id is composed of "namespace/repo_name". Hope that's make it clearer for you.

Taniya-Das commented 4 months ago

I see. Thank You for clarifying.

Wauplin commented 4 months ago

I'm closing this issue but let me know if you have more questions :)