shunk031 / huggingface-datasets_JGLUE

JGLUE: Japanese General Language Understanding Evaluation for huggingface datasets
https://huggingface.co/datasets/shunk031/JGLUE
9 stars 2 forks source link

Hard to understand error when MARC-ja dataset is not downloaded correctly #7

Open shunk031 opened 11 months ago

shunk031 commented 11 months ago

The following is an error when I ran lm-evaluation-harness (jp-stable/JGLUE) and the MARC-ja dataset did not download correctly. This turned out to be the root cause of poor network conditions and failed downloads.

Selected Tasks: ['jsquad-1.1-0.3', 'jcommonsenseqa-1.1-0.3', 'jnli-1.1-0.3', 'marc_ja-1.1-0.3']
Using device 'cuda'
You are using config.init_device='cpu', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization.
/home/shunk031/lm-evaluation-harness/lm_eval/tasks/ja/jsquad.py:75: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https:/
/huggingface.co/docs/evaluate
  self.jasquad_metric = datasets.load_metric(jasquad.__file__)
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 8501.97it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 2054.69it/s]
Traceback (most recent call last):
  File "/home/shunk031/lm-evaluation-harness/main.py", line 122, in <module>
    main()
  File "/home/shunk031/lm-evaluation-harness/main.py", line 91, in main
    results = evaluator.simple_evaluate(
  File "/home/shunk031/lm-evaluation-harness/lm_eval/utils.py", line 185, in _wrapper
    return fn(*args, **kwargs)
  File "/home/shunk031/lm-evaluation-harness/lm_eval/evaluator.py", line 82, in simple_evaluate
    task_dict = lm_eval.tasks.get_task_dict(tasks)
  File "/home/shunk031/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 373, in get_task_dict
    task_name_dict = {
  File "/home/shunk031/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 374, in <dictcomp>
    task_name: get_task(task_name)()
  File "/home/shunk031/lm-evaluation-harness/lm_eval/base.py", line 430, in __init__
    self.download(data_dir, cache_dir, download_mode)
  File "/home/shunk031/lm-evaluation-harness/lm_eval/base.py", line 459, in download
    self.dataset = datasets.load_dataset(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/load.py", line 2133, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1717, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/datasets/builder.py", line 1027, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/root/.cache/huggingface/modules/datasets_modules/datasets/shunk031--JGLUE/eed55a4f1c560114b29786d11eed4fc793f35c3b2aa9efdf5352c0bd85016b36/JGLUE.py", line 535, in _split_generators
    return self.__split_generators_marc_ja(dl_manager)
  File "/root/.cache/huggingface/modules/datasets_modules/datasets/shunk031--JGLUE/eed55a4f1c560114b29786d11eed4fc793f35c3b2aa9efdf5352c0bd85016b36/JGLUE.py", line 503, in __split_generators_marc_ja
    split_dfs = preprocess_for_marc_ja(
  File "/root/.cache/huggingface/modules/datasets_modules/datasets/shunk031--JGLUE/eed55a4f1c560114b29786d11eed4fc793f35c3b2aa9efdf5352c0bd85016b36/JGLUE.py", line 405, in preprocess_for_marc_ja
    df = df[["review_body", "star_rating", "review_id"]]
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 3767, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5877, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "/home/shunk031/lm-evaluation-harness/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5938, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['review_body', 'star_rating', 'review_id'], dtype='object')] are in the [columns]"

Since this error alone is not enough to determine if the data has not been loaded correctly, a more detailed condition is needed by displaying the contents of the data frame.

Related #9 .