Open piraka9011 opened 2 years ago
Hi ! It looks like a bug in pyarrow
. If you manage to end up with only one chunk per parquet file it should workaround this issue.
To achieve that you can try to lower the value of max_shard_size
and also don't use map
before push_to_hub
.
Do you have a minimum reproducible example that we can share with the Arrow team for further debugging ?
If you manage to end up with only one chunk per parquet file it should workaround this issue.
Yup, I did not encounter this bug when I was testing my script with a slice of <1000 samples for my dataset.
Do you have a minimum reproducible example...
Not sure if I can get more minimal than the script I shared above. Are you asking for a sample json file? Just generate a random manifest list, I can add that to the above script if that's what you mean?
Actually this is probably linked to this open issue: https://issues.apache.org/jira/browse/ARROW-5030.
setting max_shard_size="2GB"
should do the job (or max_shard_size="1GB"
if you want to be on the safe side, especially given that there can be some variance in the shard sizes if the dataset is not evenly distributed)
Describe the bug
I am fine tuning a wav2vec2 model following the script here using my own dataset: https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py
Loading my Audio dataset from the hub which was originally generated from disk results in the following PyArrow error:
Steps to reproduce the bug
I created a dataset from a JSON lines manifest of
audio_filepath
,text
, andduration
.When creating the dataset, I do something like this:
Then when I call
load_dataset()
in my training script, with the same dataset I generated above, and download from the huggingface hub I get the above stack trace. I am able to load the dataset fine if I useload_from_disk()
.Expected results
load_dataset()
should behave just likeload_from_disk()
and not cause any errors.Actual results
See above
Environment info
I am using the
huggingface/transformers-pytorch-gpu:latest
imagedatasets
version: 2.3.0