Closed thudzj closed 5 years ago
Thanks for using my code! Unfortunately, for reasons of copyrights and so on, I cannot directly distribute the data. What kind of errors happened?
Thanks! Something like 403 forbidden.
Hmm, looks tough, while I'm not familiar with connections in China. A possible way is adding a user-agent in the header of the opener.
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
In download_*.py you're using, fix like
try:
from cookielib import CookieJar
cj = CookieJar()
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
import urllib
urlretrieve = urllib.urlretrieve
except ImportError:
import http.cookiejar
cj = http.cookiejar.CookieJar()
import urllib
opener = urllib.request.build_opener(
urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urlretrieve = urllib.request.urlretrieve
If nothing changed, I already give up!
Haha, I'll try. Thank you very much for the instant reply!
Hi there,
I'm also getting 403 Forbidden error, though when I'm able to successfully download via wget [URL]
example url being: https://www.smashwords.com/books/download/12640/6/latest/0/0/eliminate-your-debt-like-a-pro.txt
Here's a screenshot for reference:
Do you succeed with wget? I guessed some kind of IP block happened.
Yes, I was able to download using wget
.
Actually, no, it fails with wget
as well and adding --user-agent=Lynx
in wget
or the above code for Mozilla agent in python don't help either.
I'm on an us-east AWS EC2 instance.
Thank you for the information. As #24 also reported, the crawling is getting hard.
@thudzj By the way, as shown in my comment (https://github.com/soskek/bookcorpus/issues/24#issuecomment-556024973), you can try the unknown file on Google Drive (at your own risk).
Hi Sosuke,
Thanks a lot for the wonderful work! I expect to obtain the bookcorpus dataset with your crawler, but I failed to crawl the articles owing to some network errors. I am afraid that I cannot achieve a complete dataset. So could you please share with me the dataset you have got, e.g. the all.txt. My email address is dengzhijiethu@gmail.com. Thanks!
Zhijie