huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
701 stars 77 forks source link

Unexpected error when a split file has a bad format #2461

Open severo opened 9 months ago

severo commented 9 months ago

See https://huggingface.co/datasets/severo/test-one-split-broken/viewer/default/works

https://datasets-server.huggingface.co/rows?dataset=severo/test-one-split-broken&config=default&split=works returns {"error":"Unexpected error."}.

The reason is that config-parquet-and-info fails because the other split ('broken') file is not in the right format: https://huggingface.co/datasets/severo/test-one-split-broken/blob/main/broken.json (it's content is CSV)

Could we detect this and show a more meaningful error?

traceback
{
  "error": "An error occurred while generating the dataset",
  "cause_exception": "DatasetGenerationError",
  "cause_message": "An error occurred while generating the dataset",
  "cause_traceback": [
    "Traceback (most recent call last):\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py\", line 144, in _generate_tables\n dataset = json.load(f)\n",
    " File \"/usr/local/lib/python3.9/json/__init__.py\", line 293, in load\n return loads(fp.read(),\n",
    " File \"/usr/local/lib/python3.9/json/__init__.py\", line 346, in loads\n return _default_decoder.decode(s)\n",
    " File \"/usr/local/lib/python3.9/json/decoder.py\", line 337, in decode\n obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n",
    " File \"/usr/local/lib/python3.9/json/decoder.py\", line 355, in raw_decode\n raise JSONDecodeError(\"Expecting value\", s, err.value) from None\n",
    "json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n",
    "\nDuring handling of the above exception, another exception occurred:\n\n",
    "Traceback (most recent call last):\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py\", line 1973, in _prepare_split_single\n for _, table in generator:\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py\", line 147, in _generate_tables\n raise e\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py\", line 121, in _generate_tables\n pa_table = paj.read_json(\n",
    " File \"pyarrow/_json.pyx\", line 308, in pyarrow._json.read_json\n",
    " File \"pyarrow/error.pxi\", line 154, in pyarrow.lib.pyarrow_internal_check_status\n",
    " File \"pyarrow/error.pxi\", line 91, in pyarrow.lib.check_status\n",
    "pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0\n",
    "\nThe above exception was the direct cause of the following exception:\n\n",
    "Traceback (most recent call last):\n",
    " File \"/src/services/worker/src/worker/job_manager.py\", line 154, in process\n job_result = self.job_runner.compute()\n",
    " File \"/src/services/worker/src/worker/job_runners/config/parquet_and_info.py\", line 1287, in compute\n compute_config_parquet_and_info_response(\n",
    " File \"/src/services/worker/src/worker/job_runners/config/parquet_and_info.py\", line 1202, in compute_config_parquet_and_info_response\n parquet_operations = convert_to_parquet(builder)\n",
    " File \"/src/services/worker/src/worker/job_runners/config/parquet_and_info.py\", line 828, in convert_to_parquet\n builder.download_and_prepare(\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py\", line 1005, in download_and_prepare\n self._download_and_prepare(\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py\", line 1100, in _download_and_prepare\n self._prepare_split(split_generator, **prepare_split_kwargs)\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py\", line 1860, in _prepare_split\n for job_id, done, content in self._prepare_split_single(\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py\", line 2016, in _prepare_split_single\n raise DatasetGenerationError(\"An error occurred while generating the dataset\") from e\n",
    "datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset\n"
  ]
}
severo commented 9 months ago

related to #1443

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.