huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.01k stars 2.63k forks source link

Loading huggan/CelebA-HQ throws pyarrow.lib.ArrowInvalid #4886

Open JeanKaddour opened 2 years ago

JeanKaddour commented 2 years ago

Describe the bug

Loading huggan/CelebA-HQ throws pyarrow.lib.ArrowInvalid

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset('huggan/CelebA-HQ')

Expected results

See https://colab.research.google.com/drive/141LJCcM2XyqprPY83nIQ-Zk3BbxWeahq?usp=sharing#scrollTo=N3ml_7f8kzDd

Actual results

  File "/home/jean/projects/cold_diffusion/celebA.py", line 4, in <module>
    dataset = load_dataset('huggan/CelebA-HQ')
  File "/home/jean/miniconda3/envs/seq/lib/python3.10/site-packages/datasets/load.py", line 1793, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/jean/miniconda3/envs/seq/lib/python3.10/site-packages/datasets/builder.py", line 704, in download_and_prepare
    self._download_and_prepare(
  File "/home/jean/miniconda3/envs/seq/lib/python3.10/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/jean/miniconda3/envs/seq/lib/python3.10/site-packages/datasets/builder.py", line 1274, in _prepare_split
    for key, table in logging.tqdm(
  File "/home/jean/miniconda3/envs/seq/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/jean/miniconda3/envs/seq/lib/python3.10/site-packages/datasets/packaged_modules/parquet/parquet.py", line 67, in _generate_tables
    parquet_file = pq.ParquetFile(f)
  File "/home/jean/miniconda3/envs/seq/lib/python3.10/site-packages/pyarrow/parquet/__init__.py", line 286, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1227, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Environment info

mariosasko commented 2 years ago

Hi! IIRC one of the files in this dataset is corrupted due to https://github.com/huggingface/datasets/pull/4081 (fixed now).

@NielsRogge Could you please re-generate and re-push this dataset (or I can do it if you share the generation script)?

alexjc commented 1 year ago

Could you put something in place to catch these problems? I'm seeing this on another dataset consistently too and I guess I can't fix it in code?

NielsRogge commented 1 year ago

Hey,

Yes the notebook I used to upload this dataset can be found here: https://colab.research.google.com/drive/141LJCcM2XyqprPY83nIQ-Zk3BbxWeahq?usp=sharing.

If you have time to regenerate the dataset, would be great.

alexjc commented 1 year ago

Sorry, maybe I wasn't clear enough that it's a different dataset laion2B-multi-joined-translated-to-en. I think there should be checks in the upload, tests on the server, or validation after download (hashes) to catch these problems.

Lots of bandwidth wasted otherwise! /cc @mariosasko

NielsRogge commented 1 year ago

Yes @alexjc sorry was more a reply to @JeanKaddour.

And indeed it'd be great to have additional checks to avoid these errors.

mariosasko commented 1 year ago

cc @severo since such checks should probably be implemented on the datasets-server side.

allanchan339 commented 1 year ago

Hi,

It seems the problem is still persist. I have encountered the exact same problem using just 2 line of code above.

The error code is as follows:

發生例外狀況: DatasetGenerationError
An error occurred while generating the dataset
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

The above exception was the direct cause of the following exception:

  File "/code/ddpm_learn/train.py", line 65, in <module>
    dataset = load_dataset("huggan/CelebA-HQ", cache_dir="./CelebA-HQ"
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
NielsRogge commented 1 year ago

Yes for the moment refer to the notebook linked above if you want to create a HF dataset yourself

allanchan339 commented 1 year ago

Hi @NielsRogge , I can help to push the dataset to the cloud. However, I cannot locate the situation so far. I wonder if

  1. the downloaded files so far has corruption s.t. the file cannot generate properly, or
  2. the downloaded files has no bug, the bug is caused by buggy upload program so that I can use what I have just downloaded to re-upload to cloud

Thank, Allan