huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.04k stars 2.64k forks source link

Bookcorpus data contains pretokenized text #486

Closed orsharir closed 1 year ago

orsharir commented 4 years ago

It seem that the bookcoprus data downloaded through the library was pretokenized with NLTK's Treebank tokenizer, which changes the text in incompatible ways to how, for instance, BERT's wordpiece tokenizer works. For example, "didn't" becomes "did" + "n't", and double quotes are changed to `` and '' for start and end quotes, respectively.

On my own projects, I just run the data through NLTK's TreebankWordDetokenizer to reverse the tokenization (as best as possible). I think it would be beneficial to apply this transformation directly on your remote cached copy of the dataset. If you choose to do so, I would also suggest to use my fork of NLTK that fixes several bugs in their detokenizer (I've opened a pull-request, but they've yet to respond): https://github.com/nltk/nltk/pull/2575

lhoestq commented 4 years ago

Yes indeed it looks like some ' and spaces are missing (for example in dont or didnt). Do you know if there exist some copies without this issue ? How would you fix this issue on the current data exactly ? I can see that the data is raw text (not tokenized) so I'm not sure I understand how you would do it. Could you provide more details ?

orsharir commented 4 years ago

I'm afraid that I don't know how to obtain the original BookCorpus data. I believe this version came from an anonymous Google Drive link posted in another issue.

Going through the raw text in this version, it's apparent that NLTK's TreebankWordTokenizer was applied on it (I gave some examples in my original post), followed by: ' '.join(tokens) You can retrieve the tokenization by splitting on whitespace. You can then "detokenize" it with TreebankWordDetokenizer class of NLTK (though, as I suggested, use the fixed version in my repo). This will bring the text closer to its original form, but some steps of TreebankWordTokenizer are destructive, so it wouldn't be one-to-one. Something along the lines of the following should work:

treebank_detokenizer = nltk.tokenize.treebank.TreebankWordDetokenizer()
db = nlp.load_dataset('bookcorpus', split=nlp.Split.TRAIN)
db = db.map(lambda x: treebank_detokenizer.detokenize(x['text'].split()))

Regarding other issues beyond the above, I'm afraid that I can't help with that.

lhoestq commented 4 years ago

Ok I get it, that would be very cool indeed

What kinds of patterns the detokenizer can't retrieve ?

orsharir commented 4 years ago

The TreebankTokenizer makes some assumptions about whitespace, parentheses, quotation marks, etc. For instance, while tokenizing the following text:

Dwayne "The Rock" Johnson

will result in:

Dwayne `` The Rock '' Johnson

where the left and right quotation marks are turned into distinct symbols. Upon reconstruction, we can attach the left part to its token on the right, and respectively for the right part. However, the following texts would be tokenized exactly the same:

Dwayne " The Rock " Johnson
Dwayne " The Rock" Johnson
Dwayne     " The Rock" Johnson
...

In the above examples, the detokenizer would correct these inputs into the canonical text

Dwayne "The Rock" Johnson

However, there are cases where there the solution cannot easily be inferred (at least without a true LM - this tokenizer is just a bunch of regexes). For instance, in cases where you have a fragment that contains the end of quote, but not its beginning, plus an accidental space:

... and it sounds fantastic, " he said.

In the above case, the tokenizer would assume that the quotes refer to the next token, and so upon detokenization it will result in the following mistake:

... and it sounds fantastic, "he said.

While these are all odd edge cases (the basic assumptions do make sense), in noisy data they can occur, which is why I mentioned that the detokenizer cannot restore the original perfectly.

arvieFrydenlund commented 4 years ago

To confirm, since this is preprocessed, this was not the exact version of the Book Corpus used to actually train the models described here (particularly Distilbert)? https://huggingface.co/datasets/bookcorpus

Or does this preprocessing exactly match that of the papers?

orsharir commented 4 years ago

I believe these are just artifacts of this particular source. It might be better to crawl it again, or use another preprocessed source, as found here: https://github.com/soskek/bookcorpus

richarddwang commented 3 years ago

Yes actually the BookCorpus on hugginface is based on this. And I kind of regret naming it as "BookCorpus" instead of something like "BookCorpusLike".

But there is a good news ! @shawwn has replicated BookCorpus in his way, and also provided a link to download the plain text files. see here. There is chance we can have a "OpenBookCorpus" !

mariosasko commented 1 year ago

Resolved via #856