Closed Taniya-Das closed 4 months ago
Hi @Taniya-Das, thanks for reporting. Could you share which code you're using and what is the exact error? I tried to reproduce the error locally but couldn't:
>>> from huggingface_hub import dataset_info
>>> dataset_info("gorkaartola/ZS-train_SDG_Descriptions_S1-sentence_S2-SDGtitle_Negative_Sample_Filter-Only_Targets")
DatasetInfo(id='gorkaartola/ZS-train_SDG_Descriptions_S1-sentence_S2-SDGtitle_Negative_Sample_Filter-Only_Targets', author='gorkaartola', ...)
>>> from huggingface_hub.utils import validate_repo_id
>>> validate_repo_id("gorkaartola/ZS-train_SDG_Descriptions_S1-sentence_S2-SDGtitle_Negative_Sample_Filter-Only_Targets")
```")
Hi @Wauplin,
Thank You for looking at it.
The error we see is - A_repo_id_should_be_between_1_and_96_characters___type_value_error
.
The length of repo_id exceeded 96 characters.
Here is our code for the validator:
import re
REPO_ID_ILLEGAL_CHARACTERS = re.compile(r"[^0-9a-zA-Z-_./]+")
MSG_PREFIX = "The platform_resource_identifier for HuggingFace should be a valid repo_id. "
def throw_error_on_invalid_identifier(platform_resource_identifier: str):
"""
Throw a ValueError on an invalid repository identifier.
Valid repo_ids:
Between 1 and 96 characters.
Either “repo_name” or “namespace/repo_name”
[a-zA-Z0-9] or ”-”, ”_”, ”.”
”—” and ”..” are forbidden
Refer to:
https://huggingface.co/docs/huggingface_hub/package_reference/utilities#huggingface_hub.utils.validate_repo_id
"""
repo_id = platform_resource_identifier
if REPO_ID_ILLEGAL_CHARACTERS.search(repo_id):
msg = "A repo_id should only contain [a-zA-Z0-9] or ”-”, ”_”, ”.”"
raise ValueError(MSG_PREFIX + msg)
if not (1 < len(repo_id) < 96):
msg = "A repo_id should be between 1 and 96 characters."
raise ValueError(MSG_PREFIX + msg)
if repo_id.count("/") > 1:
msg = (
"For new repositories, there should be a single forward slash in the repo_id ("
"namespace/repo_name). Legacy repositories are without a namespace. This repo_id has "
"too many forward slashes."
)
raise ValueError(MSG_PREFIX + msg)
if ".." in repo_id:
msg = "A repo_id may not contain multiple consecutive dots."
raise ValueError(MSG_PREFIX + msg)
Oh I see. But what's suggested in these docs is to reuse huggingface_hub.utils.validate_repo_id
, not reimplement it. If you are interested in implementation details, you can check out the source here.
For the record, the 96-characters limit is on the repo name, not the repo id. In general, the repo_id is composed of "namespace/repo_name"
. Hope that's make it clearer for you.
I see. Thank You for clarifying.
I'm closing this issue but let me know if you have more questions :)
Describe the bug
We are fetching huggingface datasets and validating the repo_id as suggested in huggingface_validators. However, we came across some datasets on huggingface which have invalid repo_id (using the above guidelines). We wanted to check why those datasets are not following the guidelines.
Here are some examples of such datasets:
Reproduction
No response
Logs
No response
System info