Open JeanKaddour opened 2 years ago
Hi! IIRC one of the files in this dataset is corrupted due to https://github.com/huggingface/datasets/pull/4081 (fixed now).
@NielsRogge Could you please re-generate and re-push this dataset (or I can do it if you share the generation script)?
Could you put something in place to catch these problems? I'm seeing this on another dataset consistently too and I guess I can't fix it in code?
Hey,
Yes the notebook I used to upload this dataset can be found here: https://colab.research.google.com/drive/141LJCcM2XyqprPY83nIQ-Zk3BbxWeahq?usp=sharing.
If you have time to regenerate the dataset, would be great.
Sorry, maybe I wasn't clear enough that it's a different dataset laion2B-multi-joined-translated-to-en
. I think there should be checks in the upload, tests on the server, or validation after download (hashes) to catch these problems.
Lots of bandwidth wasted otherwise! /cc @mariosasko
Yes @alexjc sorry was more a reply to @JeanKaddour.
And indeed it'd be great to have additional checks to avoid these errors.
cc @severo since such checks should probably be implemented on the datasets-server side.
Hi,
It seems the problem is still persist. I have encountered the exact same problem using just 2 line of code above.
The error code is as follows:
發生例外狀況: DatasetGenerationError
An error occurred while generating the dataset
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
The above exception was the direct cause of the following exception:
File "/code/ddpm_learn/train.py", line 65, in <module>
dataset = load_dataset("huggan/CelebA-HQ", cache_dir="./CelebA-HQ"
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
Yes for the moment refer to the notebook linked above if you want to create a HF dataset yourself
Hi @NielsRogge , I can help to push the dataset to the cloud. However, I cannot locate the situation so far. I wonder if
Thank, Allan
Describe the bug
Loading huggan/CelebA-HQ throws pyarrow.lib.ArrowInvalid
Steps to reproduce the bug
Expected results
See https://colab.research.google.com/drive/141LJCcM2XyqprPY83nIQ-Zk3BbxWeahq?usp=sharing#scrollTo=N3ml_7f8kzDd
Actual results
Environment info
datasets
version: datasets-2.4.1.dev0