soskek / bookcorpus

Crawl BookCorpus
MIT License
812 stars 110 forks source link

Could you share the processed all.txt? #23

Closed thudzj closed 5 years ago

thudzj commented 5 years ago

Hi Sosuke,

Thanks a lot for the wonderful work! I expect to obtain the bookcorpus dataset with your crawler, but I failed to crawl the articles owing to some network errors. I am afraid that I cannot achieve a complete dataset. So could you please share with me the dataset you have got, e.g. the all.txt. My email address is dengzhijiethu@gmail.com. Thanks!

Zhijie

soskek commented 5 years ago

Thanks for using my code! Unfortunately, for reasons of copyrights and so on, I cannot directly distribute the data. What kind of errors happened?

thudzj commented 5 years ago

Thanks! Something like 403 forbidden.

soskek commented 5 years ago

Hmm, looks tough, while I'm not familiar with connections in China. A possible way is adding a user-agent in the header of the opener.

opener.addheaders = [('User-agent', 'Mozilla/5.0')]

In download_*.py you're using, fix like

try:
    from cookielib import CookieJar
    cj = CookieJar()
    import urllib2
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    import urllib
    urlretrieve = urllib.urlretrieve
except ImportError:
    import http.cookiejar
    cj = http.cookiejar.CookieJar()
    import urllib
    opener = urllib.request.build_opener(
        urllib.request.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    urlretrieve = urllib.request.urlretrieve

If nothing changed, I already give up!

thudzj commented 5 years ago

Haha, I'll try. Thank you very much for the instant reply!

tshrjn commented 5 years ago

Hi there,

I'm also getting 403 Forbidden error, though when I'm able to successfully download via wget [URL] example url being: https://www.smashwords.com/books/download/12640/6/latest/0/0/eliminate-your-debt-like-a-pro.txt

Here's a screenshot for reference:

Screen Shot 2019-11-12 at 3 31 20 PM

soskek commented 5 years ago

Do you succeed with wget? I guessed some kind of IP block happened.

tshrjn commented 5 years ago

Yes, I was able to download using wget.

tshrjn commented 5 years ago

Actually, no, it fails with wget as well and adding --user-agent=Lynx in wget or the above code for Mozilla agent in python don't help either.

I'm on an us-east AWS EC2 instance.

soskek commented 4 years ago

Thank you for the information. As #24 also reported, the crawling is getting hard.

@thudzj By the way, as shown in my comment (https://github.com/soskek/bookcorpus/issues/24#issuecomment-556024973), you can try the unknown file on Google Drive (at your own risk).