torchaudio error while loading CommonVoice

harsh244 commented 3 years ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

import torchaudio
train_dataset = torchaudio.datasets.COMMONVOICE("./data", url='english', download=True)

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-2-fc47227a2334> in <module>
      1 import torchaudio
----> 2 train_dataset = torchaudio.datasets.COMMONVOICE("./data", url='english', download=True)

~/.local/lib/python3.6/site-packages/torchaudio/datasets/commonvoice.py in __init__(self, root, tsv, url, folder_in_archive, version, download)
    204         self._tsv = os.path.join(root, folder_in_archive, tsv)
    205 
--> 206         with open(self._tsv, "r") as tsv:
    207             walker = unicode_csv_reader(tsv, delimiter="\t")
    208             self._header = next(walker)

FileNotFoundError: [Errno 2] No such file or directory: './data/CommonVoice/cv-corpus-4-2019-12-10/en/train.tsv'

Expected behavior

It seems that there should be folder like ./data/CommonVoice/cv-corpus-4-2019-12-10/en/ in the ./data directory, but below are the actual contents of the data directory

total 40846448
drwxr-xr-x 2 divyaanand divyaanand    59330560 Dec 15 02:27 clips
-rw-rw-r-- 1 divyaanand divyaanand       15193 Dec 15 02:29 collect_env.py
-rw-r--r-- 1 divyaanand divyaanand     3555470 Dec 11  2019 dev.tsv
-rw-rw-r-- 1 divyaanand divyaanand 41448227462 Dec 15 02:10 en.tar.gz
-rw-r--r-- 1 divyaanand divyaanand    28686118 Dec 11  2019 invalidated.tsv
-rw-r--r-- 1 divyaanand divyaanand    34939509 Dec 11  2019 other.tsv
-rw-r--r-- 1 divyaanand divyaanand     3401154 Dec 11  2019 test.tsv
-rw-r--r-- 1 divyaanand divyaanand    55941783 Dec 11  2019 train.tsv
-rw-r--r-- 1 divyaanand divyaanand   192641057 Dec 11  2019 validated.tsv

It seems that everything gets extracted, but its not being put in the directory structure as expected by the code.

Environment

PyTorch version: 1.7.0 Is debug build: True CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.4 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.10.2

Python version: 3.6 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce RTX 2080 Ti GPU 2: GeForce RTX 2080 Ti GPU 3: GeForce RTX 2080 Ti

Nvidia driver version: 450.51.05 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.3 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.19.4 [pip3] torch==1.7.0 [pip3] torchaudio==0.7.0 [conda] Could not collect

harsh244 commented 3 years ago

Please let me know incase I am missing something

mthrok commented 3 years ago

Hi @harsh244

Thanks for reporting, I found the same bug and I do not think the code was working. Can you move the data from ./data to ./data/CommonVoice/cv-corpus-4-2019-12-10/en?

harsh244 commented 3 years ago

yeah, that's what i did and it works that way, but I was hoping that the code would automatically take care of creating the required directories.

mthrok commented 3 years ago

yeah, that's what i did and it works that way, but I was hoping that the code would automatically take care of creating the required directories.

Yeah, that's the expectation but there was an oversight, sorry about that. There is another development on CommonVoice. For legal reasons, we are removing the download feature, so unfortunately we do not have an opportunity to make this right in the way originally, expected. Once the #1082 is merged, we ask users to download and extract the archive manually, then provide the directory where the dataset is located.

mthrok commented 3 years ago

Closing as the work-around is confirmed and we cannot fix it anymore.

harsh244 commented 3 years ago

thanks for the info

pytorch / audio