pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.43k stars 635 forks source link

Can not load commonvoice dataset on windows #3781

Open jacobjennings opened 2 months ago

jacobjennings commented 2 months ago

🐛 Describe the bug

When loading the common voice dataset on windows, the file train.tsv is loaded using cp1252 file encoding, leading to a failure.

training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[49], line 1
----> 1 training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory)

File ~\Documents\GitHub\clarification\venv-pc\Lib\site-packages\torchaudio\datasets\commonvoice.py:55, in COMMONVOICE.__init__(self, root, tsv)
     53 walker = csv.reader(tsv_, delimiter="\t")
     54 self._header = next(walker)
---> 55 self._walker = list(walker)

File ~\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3155: character maps to <undefined>

Versions

Python 3.11

mogwai commented 2 months ago

You can try to download it from hugging face:

https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0