Open jacobjennings opened 2 months ago
When loading the common voice dataset on windows, the file train.tsv is loaded using cp1252 file encoding, leading to a failure.
train.tsv
training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory)
--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) Cell In[49], line 1 ----> 1 training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory) File ~\Documents\GitHub\clarification\venv-pc\Lib\site-packages\torchaudio\datasets\commonvoice.py:55, in COMMONVOICE.__init__(self, root, tsv) 53 walker = csv.reader(tsv_, delimiter="\t") 54 self._header = next(walker) ---> 55 self._walker = list(walker) File ~\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final) 22 def decode(self, input, final=False): ---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3155: character maps to <undefined>
Python 3.11
You can try to download it from hugging face:
https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
🐛 Describe the bug
When loading the common voice dataset on windows, the file
train.tsv
is loaded using cp1252 file encoding, leading to a failure.Versions
Python 3.11