Open wxp16 opened 4 years ago
Same for me as well
It seems this crawler is too aggressive and we got banned
Hmm. It seems that smashwords.com is making its blocking very strict. For avoiding that, we need really patient (maybe too slow) crawling or proxy-based crawling with multiple IPs of proxy servers, both of which are tough. I guess crawling BookCorpus now became really difficult.
Then, I had hesitated to say this so far. But, now this may help you and others in the NLP community.
I found a tweet by someone suggests the existence of a copy of the original BookCorpus on Google drive. The dataset seems to have 74004228 lines and 984846357 tokens, which matches the stats written in the paper.
If you try, (of course!) please use it at your own risk.
Hi.
Have you downloaded or used the bookcorpus mirror in the tweet that you just linked?
I tried downloading and working with that dataset and I am running into an issue that I cannot distinguish when a book ends and another one starts in the complete txt file. It is just a continuous list of sentences. (I notice in your code you have done that by inserting 4 'Enter' after every book.. The same behavior is not present in that corpus)
Do you have any suggestions?
Thank You
I have no idea. I don't know even what was distributed first. Most of the language models (even document-level ones) or others, including skip-thought vectors, are not troubled with the lack of boundaries between books. So, the original version might be distributed as the tweeted one is. Wish you well.
Ohh I see. I was actually worried about reproducing the 'Sentence-Order Prediction Task' from the paper ALBERT : A Lite BERT
They emphasize that the 2 sentences taken during training are 50% of the time from the same document and 50% of times from different documents. I will try to read the fine print in their paper to see if document separation is even an issue or not.
Thanks anyway. The tweet link was really helpful.
That sounds an issue. One dirty (but maybe practically good) trick is seeing near lines (e.g. within 100 lines) as same-document and distant lines (e.g. outside 100000 lines) as different-document. Of course, it is good to ask the authors. Anyway, good luck!
I just tried to investigate further. For example if you go to this url -- https://www.smashwords.com/books/download/626006/8/latest/0/0/the-watchmakers-daughter.epub it says you have to have an account to read this book. Accounts are free but it might take some work to get the crawler to use your login....
as paper said, the dataset contains totally 11038 books, and 74004228 sentences which is the same size compares with the tweet dataset. So I seperate books each 6700 sentences(74004228/11038 ).Not sure whether this seperation will affect the accuracy or not...
I can verify, as best as possible, that the link as of today is clean. It just extracts two large text files.
Hi all, uses the text mirror mentioned in the comment above, my pr that adds BookCorpus to HuggingFace/nlp has been merged. (the txt files has been copied to their own cloud storage)
You should be able to download the dataset by book = nlp.load_dataset('bookcorpus')
Thank you @richarddwang! It is a great job. I added the reference to nlp in README of this repo.
I tried to download the bookscorpus data. So far I just downloaded around 5000 books. Can anyone get all the books? I met a lot
HTTP Error: 403 Forbidden
How to fix this ? Or can i get the all the bookscorpus data from somewhere ?Thanks