lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
935 stars 214 forks source link

Common Voice download fails with a 403 error #1324

Open daniel-dona opened 5 months ago

daniel-dona commented 5 months ago

Found testing icefall egs for commonvoice/ASR on ./prepare.sh

Running lhotse download commonvoice [...] results in a HTTP error:

2024-04-17 21:54:04,887 INFO [commonvoice.py:84] Language: fr
Downloading CommonVoice languages:   0%|                                                                                                                                                                                                                  | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/dani/.local/bin/lhotse", line 8, in <module>
    sys.exit(cli())
  File "/home/dani/.local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/dani/.local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/dani/.local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/dani/.local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/dani/.local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/dani/.local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/dani/.local/lib/python3.10/site-packages/lhotse/bin/modes/recipes/commonvoice.py", line 76, in commonvoice
    download_commonvoice(
  File "/home/dani/.local/lib/python3.10/site-packages/lhotse/recipes/commonvoice.py", line 105, in download_commonvoice
    resumable_download(
  File "/home/dani/.local/lib/python3.10/site-packages/lhotse/utils.py", line 543, in resumable_download
    raise e
  File "/home/dani/.local/lib/python3.10/site-packages/lhotse/utils.py", line 517, in resumable_download
    _download(req, file_size)
  File "/home/dani/.local/lib/python3.10/site-packages/lhotse/utils.py", line 499, in _download
    with urllib.request.urlopen(rq) as response:
  File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.10/urllib/request.py", line 525, in open
    response = meth(req, response)
  File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.10/urllib/request.py", line 563, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Looks like S3 downloads need to be signed...

pzelasko commented 5 months ago

Something might have changed on CommonVoice side. If the issue persists, it may be best to download directly from their site.

daniel-dona commented 5 months ago

Something might have changed on CommonVoice side. If the issue persists, it may be best to download directly from their site.

We can use the same method they use in the original page, they have an API (undocumented)

https://gist.github.com/daniel-dona/e1bce1d8ab01284d019d087664127cba