return subword information

matlab-deep-learning / transformer-models

Deep Learning Transformer models in MATLAB

Other

206 stars 61 forks source link

return subword information #16

Closed ccreutzi closed 2 years ago

ccreutzi commented 3 years ago

Information about which tokens are “normal” tokens vs. subword tokenization results can be very useful in downstream usage.

lam19892089 commented 2 years ago

.patch

bwdGitHub commented 2 years ago

We resolved this internally by allowing the bert.tokenizer.internal.FullTokenizer to use the tokenizedDocument tokenization in place of the bert.tokenizer.internal.BasicTokenizer which reimplemented the original BERT tokenization.

That extension wasn't pushed to this repo, I'll create an issue to see if it has any interest.