Closed severo closed 1 year ago
Note that /search, which is also created on the fly, has the expected behavior:
https://datasets-server.huggingface.co/search?dataset=atomic&config=atomic&split=train&query=aaa
500
{
"error": "Couldn't get the size of external files in `_split_generators` because a request failed:\n404 Client Error: Not Found for url: https://maartensap.com/atomic/data/atomic_data.tgz\nPlease consider moving your data files in this dataset repository instead (e.g. inside a data/ folder).",
"cause_exception": "HTTPError",
"cause_message": "404 Client Error: Not Found for url: https://maartensap.com/atomic/data/atomic_data.tgz",
"cause_traceback": [
"Traceback (most recent call last):\n",
" File \"/src/services/worker/src/worker/job_runners/config/parquet_and_info.py\", line 506, in raise_if_too_big_from_external_data_files\n for i, size in enumerate(pool.imap_unordered(get_size, ext_data_files)):\n",
" File \"/usr/local/lib/python3.9/multiprocessing/pool.py\", line 870, in next\n raise value\n",
" File \"/usr/local/lib/python3.9/multiprocessing/pool.py\", line 125, in worker\n result = (True, func(*args, **kwds))\n",
" File \"/src/services/worker/src/worker/job_runners/config/parquet_and_info.py\", line 402, in _request_size\n response.raise_for_status()\n",
" File \"/src/services/worker/.venv/lib/python3.9/site-packages/requests/models.py\", line 1021, in raise_for_status\n raise HTTPError(http_error_msg, response=self)\n",
"requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://maartensap.com/atomic/data/atomic_data.tgz\n"
]
}
Also: https://datasets-server.huggingface.co/opt-in-out-url?dataset=echarlaix/vqa-lxmert
returns 404, while it should return 501 (because it's on the block list), see
https://datasets-server.huggingface.co/parquet?dataset=echarlaix/vqa-lxmert
Wrong example, thanks @AndreaFrancis for correcting me.
Also: https://datasets-server.huggingface.co/opt-in-out-url?dataset=echarlaix/vqa-lxmert
returns 404, while it should return 501 (because it's on the block list), see
https://datasets-server.huggingface.co/parquet?dataset=echarlaix/vqa-lxmert
The URL for opt-in-out is wrong, it should be https://datasets-server.huggingface.co/opt-in-out-urls?dataset=echarlaix/vqa-lxmert
I don't think we should return 501 because opt-in-out-url job runners don't depend on parquet processing (So, no relation with the blocked list). At least we make opt-in-out-url depend on config-parquet-and-info or maybe we could implement another step to validate if the dataset is not blocked?
For example, https://datasets-server.huggingface.co/rows?dataset=atomic&config=atomic&split=train returns 404, Not found. It should instead return a detailed error which helps the user debug, as it's done on all the cached responses. /rows is special, as it's created on the fly, but it should stick with the same logic: copying the previous step error:
It should return