Closed cybest0608 closed 2 months ago
it seems the bug will happened in all windows system, I tried it in windows8.1, 10, 11 and all of them failed. But it won't happened in the Linux(Ubuntu and Centos7) and Mac (both my virtual and physical machine). I still don't know what the problem is. May be related to the path? I cannot run the split file in my windows server which created in Linux (even I replace the path in the arrow document)....work for it for a week but still cannot fix it .....upset
Have you properly logged in? Are you using the a valid token?
Note that this dataset is gated and you must follow the right procedure to be able to access it. You can find more info in the docs: https://huggingface.co/docs/hub/datasets-gated#access-gated-datasets-as-a-user
Have you properly logged in? Are you using the a valid token?
Note that this dataset is gated and you must follow the right procedure to be able to access it. You can find more info in the docs: https://huggingface.co/docs/hub/datasets-gated#access-gated-datasets-as-a-user
I finally found it what happened. It is not about the logging. When I copy the dataset from its original path (C:/Users/cybes/.cache/huggingface/datasets/downloads/extracted/XXX/cv-corpus-7.0-2021-07-21) to the desktop and load each tsv in it one by one , when I load the test spilt, the following warning occurs: "ArrowInvalid: Failed to parse string: 'Benchmark' as a scalar of type double"
Then I manually deleted them in the "segment", the error won't happen anymore, even I replace the original path with these revised tsv and use the previous loading method (common_voice_train = load_dataset("mozilla-foundation/common_voice_7_0", "ja", split="train", trust_remote_code=True)). It can work properly.
Describe the bug
when I use load_dataset methods to load mozilla-foundation/common_voice_7_0, it can successfully download and extracted the dataset but It cannot generating the arrow document, This bug happened in my server, my laptop, so as #6906 , but it won't happen in the google colab. I work for it for days, even I load the datasets from local path, it can Generating train split and validation split but bug happen again in test split.
Steps to reproduce the bug
from datasets import load_dataset, load_metric, Audio
common_voice_train = load_dataset("mozilla-foundation/common_voice_7_0", "ja", split="train", token=selftoken, trust_remote_code=True)
Expected behavior
Environment info
Environment: python 3.9 windows 11 pro VScode+jupyter