huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
698 stars 77 forks source link

rows returns 404 instead of 500 on dataset error #1661

Closed severo closed 1 year ago

severo commented 1 year ago

For example, https://datasets-server.huggingface.co/rows?dataset=atomic&config=atomic&split=train returns 404, Not found. It should instead return a detailed error which helps the user debug, as it's done on all the cached responses. /rows is special, as it's created on the fly, but it should stick with the same logic: copying the previous step error:

It should return

500
{
  "error": "Couldn't get the size of external files in `_split_generators` because a request failed:\n404 Client Error: Not Found for url: https://maartensap.com/atomic/data/atomic_data.tgz\nPlease consider moving your data files in this dataset repository instead (e.g. inside a data/ folder).",
  "cause_exception": "HTTPError",
  "cause_message": "404 Client Error: Not Found for url: https://maartensap.com/atomic/data/atomic_data.tgz",
  "cause_traceback": [
    "Traceback (most recent call last):\n",
    " File \"/src/services/worker/src/worker/job_runners/config/parquet_and_info.py\", line 506, in raise_if_too_big_from_external_data_files\n for i, size in enumerate(pool.imap_unordered(get_size, ext_data_files)):\n",
    " File \"/usr/local/lib/python3.9/multiprocessing/pool.py\", line 870, in next\n raise value\n",
    " File \"/usr/local/lib/python3.9/multiprocessing/pool.py\", line 125, in worker\n result = (True, func(*args, **kwds))\n",
    " File \"/src/services/worker/src/worker/job_runners/config/parquet_and_info.py\", line 402, in _request_size\n response.raise_for_status()\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/requests/models.py\", line 1021, in raise_for_status\n raise HTTPError(http_error_msg, response=self)\n",
    "requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://maartensap.com/atomic/data/atomic_data.tgz\n"
  ],
  "copied_from_artifact": {
    "kind": "config-parquet-metadata",
    "dataset": "atomic",
    "config": "atomic",
    "split": null
  }
}
severo commented 1 year ago

Note that /search, which is also created on the fly, has the expected behavior:

https://datasets-server.huggingface.co/search?dataset=atomic&config=atomic&split=train&query=aaa

500
{
  "error": "Couldn't get the size of external files in `_split_generators` because a request failed:\n404 Client Error: Not Found for url: https://maartensap.com/atomic/data/atomic_data.tgz\nPlease consider moving your data files in this dataset repository instead (e.g. inside a data/ folder).",
  "cause_exception": "HTTPError",
  "cause_message": "404 Client Error: Not Found for url: https://maartensap.com/atomic/data/atomic_data.tgz",
  "cause_traceback": [
    "Traceback (most recent call last):\n",
    "  File \"/src/services/worker/src/worker/job_runners/config/parquet_and_info.py\", line 506, in raise_if_too_big_from_external_data_files\n    for i, size in enumerate(pool.imap_unordered(get_size, ext_data_files)):\n",
    "  File \"/usr/local/lib/python3.9/multiprocessing/pool.py\", line 870, in next\n    raise value\n",
    "  File \"/usr/local/lib/python3.9/multiprocessing/pool.py\", line 125, in worker\n    result = (True, func(*args, **kwds))\n",
    "  File \"/src/services/worker/src/worker/job_runners/config/parquet_and_info.py\", line 402, in _request_size\n    response.raise_for_status()\n",
    "  File \"/src/services/worker/.venv/lib/python3.9/site-packages/requests/models.py\", line 1021, in raise_for_status\n    raise HTTPError(http_error_msg, response=self)\n",
    "requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://maartensap.com/atomic/data/atomic_data.tgz\n"
  ]
}
severo commented 1 year ago

Also: https://datasets-server.huggingface.co/opt-in-out-url?dataset=echarlaix/vqa-lxmert

returns 404, while it should return 501 (because it's on the block list), see

https://datasets-server.huggingface.co/parquet?dataset=echarlaix/vqa-lxmert

Wrong example, thanks @AndreaFrancis for correcting me.

AndreaFrancis commented 1 year ago

Also: https://datasets-server.huggingface.co/opt-in-out-url?dataset=echarlaix/vqa-lxmert

returns 404, while it should return 501 (because it's on the block list), see

https://datasets-server.huggingface.co/parquet?dataset=echarlaix/vqa-lxmert

The URL for opt-in-out is wrong, it should be https://datasets-server.huggingface.co/opt-in-out-urls?dataset=echarlaix/vqa-lxmert

I don't think we should return 501 because opt-in-out-url job runners don't depend on parquet processing (So, no relation with the blocked list). At least we make opt-in-out-url depend on config-parquet-and-info or maybe we could implement another step to validate if the dataset is not blocked?

AndreaFrancis commented 1 year ago

Fixed by https://github.com/huggingface/datasets-server/pull/1747 image