shunk031 / huggingface-datasets_JGLUE

JGLUE: Japanese General Language Understanding Evaluation for huggingface datasets
https://huggingface.co/datasets/shunk031/JGLUE
9 stars 2 forks source link

improve error message #8

Open shunk031 opened 11 months ago

shunk031 commented 11 months ago

Closed #7. Related #9 .

The contents of the downloaded data are displayed so that it can be easily confirmed that the data is corrupt.

Traceback (most recent call last):
  File "/home/shunk031/.cache/huggingface/modules/datasets_modules/datasets/shunk031--JGLUE/eed55a4f1c560114b29786d11eed4fc793f35c3b2aa9efdf5352c0bd85016b36/JGLUE.py", line 406, in preprocess_for_marc_ja
    df = df[["review_body", "star_rating", "review_id"]]
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 3767, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5877, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5938, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['review_body', 'star_rating', 'review_id'], dtype='object')] are in the [columns]"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/shunk031/lm-evaluation-harness/main.py", line 122, in <module>
    main()
  File "/home/shunk031/lm-evaluation-harness/main.py", line 91, in main
    results = evaluator.simple_evaluate(
  File "/home/shunk031/lm-evaluation-harness/lm_eval/utils.py", line 185, in _wrapper
    return fn(*args, **kwargs)
  File "/home/shunk031/lm-evaluation-harness/lm_eval/evaluator.py", line 82, in simple_evaluate
    task_dict = lm_eval.tasks.get_task_dict(tasks)
  File "/home/shunk031/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 373, in get_task_dict
    task_name_dict = {
  File "/home/shunk031/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 374, in <dictcomp>
    task_name: get_task(task_name)()
  File "/home/shunk031/lm-evaluation-harness/lm_eval/base.py", line 430, in __init__
    self.download(data_dir, cache_dir, download_mode)
  File "/home/shunk031/lm-evaluation-harness/lm_eval/base.py", line 459, in download
    self.dataset = datasets.load_dataset(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/load.py", line 2133, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1717, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1027, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/home/shunk031/.cache/huggingface/modules/datasets_modules/datasets/shunk031--JGLUE/eed55a4f1c560114b29786d11eed4fc793f35c3b2aa9efdf5352c0bd85016b36/JGLUE.py", line 542, in _split_generators
    return self.__split_generators_marc_ja(dl_manager)
  File "/home/shunk031/.cache/huggingface/modules/datasets_modules/datasets/shunk031--JGLUE/eed55a4f1c560114b29786d11eed4fc793f35c3b2aa9efdf5352c0bd85016b36/JGLUE.py", line 510, in __split_generators_marc_ja
    split_dfs = preprocess_for_marc_ja(
  File "/home/shunk031/.cache/huggingface/modules/datasets_modules/datasets/shunk031--JGLUE/eed55a4f1c560114b29786d11eed4fc793f35c3b2aa9efdf5352c0bd85016b36/JGLUE.py", line 408, in preprocess_for_marc_ja
    raise ValueError(
ValueError: Invalid data loaded from /home/shunk031/.cache/huggingface/datasets/downloads/9607c6909a47d484324aa65d0f7523465575084911d938d039939eafe706542f:
              <?xml version="1.0" encoding="UTF-8"?>
0  <Error><Code>AccessDenied</Code><Message>Acces...
shunk031 commented 11 months ago

CI fails because the MARC-ja dataset cannot be downloaded at this time (ref. #9 ).