Open nikregrado opened 4 years ago
@adarob, do you know about this error ?
Sounds like it could be an transient issue with the Common Crawl download server being overloaded. Have you tried rerunning it?
Yes, I’m trying a week to download this dataset, and got this error)
Is it always that exact error or is it for different URLs? You could also try using --register_checksums
which will force it to download everything in one thread. It will be very slow but may get around this issue, which I suspect is throttling from the server.
This is not help, i'm got the same error but now on different urls
I using default script to download c4 with config en.realnewslike, and getting an error RuntimeError: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='commoncrawl.s3.amazonaws.com', port=443): Max retries exceeded with url: /crawl-data/CC-MAIN-2019-18/segments/1555578530176.6/wet/CC-MAIN-20190421040427-20190421062427-00300.warc.wet.gz (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1dcd61ba90>: Failed to establish a new connection: [Errno 110] Connection timed out')) [while running 'Map(download_url)'] After 10 error like this, job on dataflow failing, I use tesnorflow-datasets==3.1.0 and python 3.7