huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
675 stars 73 forks source link

Unescaped config names with special characters in the URL #2992

Open marcenacp opened 1 month ago

marcenacp commented 1 month ago

When playing with mlcroissant, we observed the following issue:

bigcode/commitpackft has both the configs c and c#. When going to https://huggingface.co/api/datasets/bigcode/commitpackft/parquet/c#/train/0.parquet, it lists https://huggingface.co/api/datasets/bigcode/commitpackft/parquet/c/train/0.parquet (instead of https://huggingface.co/api/datasets/bigcode/commitpackft/parquet/c%23/train/0.parquet).

Should dataset names / config names be escaped in the URLs?

cc @severo @lhoestq

severo commented 1 month ago

sure. Thanks for reporting.

severo commented 1 month ago

related to https://github.com/huggingface/dataset-viewer/issues/2343 and more generally to https://github.com/huggingface/dataset-viewer/issues?q=is%3Aopen+is%3Aissue+label%3A%22name+issue%22

marcenacp commented 1 month ago

@severo Do I understand correctly that each service should:

  1. deserialize the names from the URL before using the name
  2. call other services with serialized names in the URL?

Do you see a way to fix it more gradually service by service (e.g., starting by /parquet)? How can we make sure that we don't break anybody relying on names not being serialized in the URL?

Thanks!