Connection Error while trying download c4/en.realnewslike

nikregrado commented 4 years ago

I using default script to download c4 with config en.realnewslike, and getting an error RuntimeError: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='commoncrawl.s3.amazonaws.com', port=443): Max retries exceeded with url: /crawl-data/CC-MAIN-2019-18/segments/1555578530176.6/wet/CC-MAIN-20190421040427-20190421062427-00300.warc.wet.gz (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1dcd61ba90>: Failed to establish a new connection: [Errno 110] Connection timed out')) [while running 'Map(download_url)'] After 10 error like this, job on dataflow failing, I use tesnorflow-datasets==3.1.0 and python 3.7

Conchylicultor commented 4 years ago

@adarob, do you know about this error ?

adarob commented 4 years ago

Sounds like it could be an transient issue with the Common Crawl download server being overloaded. Have you tried rerunning it?

nikregrado commented 4 years ago

Yes, I’m trying a week to download this dataset, and got this error)

adarob commented 4 years ago

Is it always that exact error or is it for different URLs? You could also try using --register_checksums which will force it to download everything in one thread. It will be very slow but may get around this issue, which I suspect is throttling from the server.

nikregrado commented 4 years ago

This is not help, i'm got the same error but now on different urls

tensorflow / datasets

Connection Error while trying download c4/en.realnewslike #2187