Closed ccreutzi closed 2 years ago
.patch
We resolved this internally by allowing the bert.tokenizer.internal.FullTokenizer
to use the tokenizedDocument
tokenization in place of the bert.tokenizer.internal.BasicTokenizer
which reimplemented the original BERT tokenization.
That extension wasn't pushed to this repo, I'll create an issue to see if it has any interest.
Information about which tokens are “normal” tokens vs. subword tokenization results can be very useful in downstream usage.