huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.07k stars 2.64k forks source link

Cant Downlaod Common Voice 17.0 hy-AM #6848

Open mheryerznkanyan opened 5 months ago

mheryerznkanyan commented 5 months ago

Describe the bug

I want to download Common Voice 17.0 hy-AM but it returns an error.


The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_name='hfds_config', config_path=None)
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/usr/local/lib/python3.10/dist-packages/datasets/load.py:1429: FutureWarning: The repository for mozilla-foundation/common_voice_17_0 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mozilla-foundation/common_voice_17_0
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Reading metadata...: 6180it [00:00, 133224.37it/s]les/s]
Generating train split: 0 examples [00:00, ? examples/s]
HuggingFace datasets failed due to some reason (stack trace below).
For certain datasets (eg: MCV), it may be necessary to login to the huggingface-cli (via `huggingface-cli login`).
Once logged in, you need to set `use_auth_token=True` when calling this script.

Traceback error for reference :

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1743, in _prepare_split_single
    example = self.info.features.encode_example(record) if self.info.features is not None else record
  File "/usr/local/lib/python3.10/dist-packages/datasets/features/features.py", line 1878, in encode_example
    return encode_nested_example(self, example)
  File "/usr/local/lib/python3.10/dist-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/usr/local/lib/python3.10/dist-packages/datasets/features/features.py", line 1243, in <dictcomp>
    {
  File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 326, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/usr/local/lib/python3.10/dist-packages/datasets/utils/py_utils.py", line 326, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
KeyError: 'sentence_id'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/nemo/scripts/speech_recognition/convert_hf_dataset_to_nemo.py", line 358, in main
    dataset = load_dataset(
  File "/usr/local/lib/python3.10/dist-packages/datasets/load.py", line 2549, in load_dataset
    builder_instance.download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1767, in _download_and_prepare
    super()._download_and_prepare(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1100, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1605, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1762, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Steps to reproduce the bug

from datasets import load_dataset

cv_17 = load_dataset("mozilla-foundation/common_voice_17_0", "hy-AM")

Expected behavior

It works fine with common_voice_16_1

Environment info

SalomonKisters commented 4 months ago

Same issue here.