soskek / bookcorpus

Crawl BookCorpus
MIT License
813 stars 110 forks source link

Add short sleep after successful download #6

Closed yoquankara closed 5 years ago

yoquankara commented 5 years ago

This helps to reduce HTTP Error 503 which is likely caused by service limitation at server side.

soskek commented 5 years ago

Thanks for PR. Does this very short sleep really help?

yoquankara commented 5 years ago

Yes, it did help. Before this PR, my download was very intermittent with frequent HTTP error 503 (Service Temporarily Unavailable) and retries.

However, I think it depends on specific network environment. So this is hardly the only best value. I can change to some larger value, like 5ms, if you prefer. But the larger it is the longer the download.

yoquankara commented 5 years ago

Btw, I didn't touch download_list.py. If we agree on the value, I will fix that file too.

soskek commented 5 years ago

OK! It has small sideeffect. So, I'll merge it as a trial. Thank you!

In fact, I already made sleep in download_list.py in the loop. https://github.com/soskek/bookcorpus/blob/973edec568f14e5eba2ea57a595d703708696ad9/download_list.py#L60

I missed to make it in download_files.py only, though I even set SLEEP_SEC... https://github.com/soskek/bookcorpus/blob/973edec568f14e5eba2ea57a595d703708696ad9/download_files.py#L28

yoquankara commented 5 years ago

Ah, I see :-) Thank you for merging! I agree to treat it as a trial for later refactoring.