mgalley / DSTC7-End-to-End-Conversation-Modeling

Grounded conversational dataset for end-to-end conversational AI (official DSTC7 data)
http://workshop.colips.org/dstc7/
174 stars 31 forks source link

Common Crawl error code 503/ 502 #5

Open henryhungle opened 6 years ago

henryhungle commented 6 years ago

Hi,

Thank you for releasing the codes for data extraction. I am extracting the data based on your scripts and I noted some errors in the log file. Most of them are Common Crawl error code 502/503 and there seems to be 5 retry attempts.

Will this affect the quality of my dataset? Do I need to run the scripts again?

A sample logs are show below: Common Crawl error code 502, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2013-20-index?url=http%3A%2F%2Fwikipedia.org%2Fwiki%2FErnest_Hemingway%23Cuba_and_the_Nobel_Prize%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2015-22-index?url=http%3A%2F%2Fwww.dailymotion.com%2Fvideo%2Fxx2dlk_y2-2yyyyyy_lifestyle%23from%3Dembediframe%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2015-27-index?url=http%3A%2F%2Fwww.dailymotion.com%2Fvideo%2Fxx2dlk_y2-2yyyyyy_lifestyle%23from%3Dembediframe%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2016-40-index?url=http%3A%2F%2Fwww.dailymotion.com%2Fvideo%2Fxx2dlk_y2-2yyyyyy_lifestyle%23from%3Dembediframe%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2017-09-index?url=http%3A%2F%2Fwww.dailymotion.com%2Fvideo%2Fxx2dlk_y2-2yyyyyy_lifestyle%23from%3Dembediframe%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2017-13-index?url=http%3A%2F%2Fwww.dailymotion.com%2Fvideo%2Fxx2dlk_y2-2yyyyyy_lifestyle%23from%3Dembediframe%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2017-17-index?url=http%3A%2F%2Fwww.dailymotion.com%2Fvideo%2Fxx2dlk_y2-2yyyyyy_lifestyle%23from%3Dembediframe%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2017-51-index?url=http%3A%2F%2Fwww.dailymotion.com%2Fvideo%2Fxx2dlk_y2-2yyyyyy_lifestyle%23from%3Dembediframe%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2018-17-index?url=http%3A%2F%2Fwww.dailymotion.com%2Fvideo%2Fxx2dlk_y2-2yyyyyy_lifestyle%23from%3Dembediframe%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2014-10-index?url=http%3A%2F%2Fwww.dailymotion.com%2Fvideo%2Fxx2dlk_y2-2yyyyyy_lifestyle%23from%3Dembediframe%2F&output=json Common Crawl error code 502, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2014-10-index?url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FHaggis%23Outside_Scotland%2F&output=json

iYUYUE commented 6 years ago

Same thing here!

Common Crawl error code 503, waiting 3 seconds... (retry attempt 2/5), url: http://index.commoncrawl.org/CC-MAIN-2014-49-index?url=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F548aabe2-2405-11e0-bef0-00144feab49a%2Cs01%3D1.html%23axzz1BaVTWxgB%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2015-14-index?url=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F548aabe2-2405-11e0-bef0-00144feab49a%2Cs01%3D1.html%23axzz1BaVTWxgB%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2015-48-index?url=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F548aabe2-2405-11e0-bef0-00144feab49a%2Cs01%3D1.html%23axzz1BaVTWxgB%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 2/5), url: http://index.commoncrawl.org/CC-MAIN-2015-48-index?url=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F548aabe2-2405-11e0-bef0-00144feab49a%2Cs01%3D1.html%23axzz1BaVTWxgB%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2016-30-index?url=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F548aabe2-2405-11e0-bef0-00144feab49a%2Cs01%3D1.html%23axzz1BaVTWxgB%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2016-40-index?url=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F548aabe2-2405-11e0-bef0-00144feab49a%2Cs01%3D1.html%23axzz1BaVTWxgB%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2017-04-index?url=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F548aabe2-2405-11e0-bef0-00144feab49a%2Cs01%3D1.html%23axzz1BaVTWxgB%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2017-26-index?url=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F548aabe2-2405-11e0-bef0-00144feab49a%2Cs01%3D1.html%23axzz1BaVTWxgB%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2017-51-index?url=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F548aabe2-2405-11e0-bef0-00144feab49a%2Cs01%3D1.html%23axzz1BaVTWxgB%2F&output=json Common Crawl error code 503, waiting 3 seconds... (retry attempt 1/5), url: http://index.commoncrawl.org/CC-MAIN-2018-05-index?url=http%3A%2F%2Fwww.ft.com%2Fcms%2Fs%2F0%2F548aabe2-2405-11e0-bef0-00144feab49a%2Cs01%3D1.html%23axzz1BaVTWxgB%2F&output=json

PickHub commented 5 years ago

Did anyone solve this?

pelamx commented 1 year ago

Getting links from the latest 93 commoncrawl.org index collections (this can take a while for some do[ 504 ] Error for https://index.commoncrawl.org/CC-MAIN-2018-30-index
[ 504 ] Error for https://index.commoncrawl.org/CC-MAIN-2014-23-index
[ 504 ] Error for https://index.commoncrawl.org/CC-MAIN-2021-17-index
[ 504 ] Error for https://index.commoncrawl.org/CC-MAIN-2022-27-index
[ 504 ] Error for https://index.commoncrawl.org/CC-MAIN-2016-22-index
[ 504 ] Error for https://index.commoncrawl.org/CC-MAIN-2020-40-index
[ 504 ] Error for https://index.commoncrawl.org/CC-MAIN-2013-20-index

Have above issues code with 504, anyone has solution ?