soskek / bookcorpus

Crawl BookCorpus
MIT License
801 stars 110 forks source link

Can anyone download all the files in the url list file? #24

Open wxp16 opened 4 years ago

wxp16 commented 4 years ago

I tried to download the bookscorpus data. So far I just downloaded around 5000 books. Can anyone get all the books? I met a lot HTTP Error: 403 Forbidden How to fix this ? Or can i get the all the bookscorpus data from somewhere ?

Thanks

RinaldsViksna commented 4 years ago

Same for me as well

RinaldsViksna commented 4 years ago

It seems this crawler is too aggressive and we got banned

soskek commented 4 years ago

Hmm. It seems that smashwords.com is making its blocking very strict. For avoiding that, we need really patient (maybe too slow) crawling or proxy-based crawling with multiple IPs of proxy servers, both of which are tough. I guess crawling BookCorpus now became really difficult.

soskek commented 4 years ago

Then, I had hesitated to say this so far. But, now this may help you and others in the NLP community.

I found a tweet by someone suggests the existence of a copy of the original BookCorpus on Google drive. The dataset seems to have 74004228 lines and 984846357 tokens, which matches the stats written in the paper.

If you try, (of course!) please use it at your own risk.

prakharg24 commented 4 years ago

Hi.

Have you downloaded or used the bookcorpus mirror in the tweet that you just linked?

I tried downloading and working with that dataset and I am running into an issue that I cannot distinguish when a book ends and another one starts in the complete txt file. It is just a continuous list of sentences. (I notice in your code you have done that by inserting 4 'Enter' after every book.. The same behavior is not present in that corpus)

Do you have any suggestions?

Thank You

soskek commented 4 years ago

I have no idea. I don't know even what was distributed first. Most of the language models (even document-level ones) or others, including skip-thought vectors, are not troubled with the lack of boundaries between books. So, the original version might be distributed as the tweeted one is. Wish you well.

prakharg24 commented 4 years ago

Ohh I see. I was actually worried about reproducing the 'Sentence-Order Prediction Task' from the paper ALBERT : A Lite BERT

They emphasize that the 2 sentences taken during training are 50% of the time from the same document and 50% of times from different documents. I will try to read the fine print in their paper to see if document separation is even an issue or not.

Thanks anyway. The tweet link was really helpful.

soskek commented 4 years ago

That sounds an issue. One dirty (but maybe practically good) trick is seeing near lines (e.g. within 100 lines) as same-document and distant lines (e.g. outside 100000 lines) as different-document. Of course, it is good to ask the authors. Anyway, good luck!

jsc commented 4 years ago

I just tried to investigate further. For example if you go to this url -- https://www.smashwords.com/books/download/626006/8/latest/0/0/the-watchmakers-daughter.epub it says you have to have an account to read this book. Accounts are free but it might take some work to get the crawler to use your login....

BillMK commented 4 years ago

as paper said, the dataset contains totally 11038 books, and 74004228 sentences which is the same size compares with the tweet dataset. So I seperate books each 6700 sentences(74004228/11038 ).Not sure whether this seperation will affect the accuracy or not... image

dgrahn commented 4 years ago

I can verify, as best as possible, that the link as of today is clean. It just extracts two large text files.

richarddwang commented 4 years ago

Hi all, uses the text mirror mentioned in the comment above, my pr that adds BookCorpus to HuggingFace/nlp has been merged. (the txt files has been copied to their own cloud storage)

You should be able to download the dataset by book = nlp.load_dataset('bookcorpus')

soskek commented 4 years ago

Thank you @richarddwang! It is a great job. I added the reference to nlp in README of this repo.