A few things that speed up the tokenizer, about 2-3x from cases I've checked:
Remove redundant white-space tokenization in BasicTokenizer
Convert basic tokenized tokens to UTF32 in one call in FullTokenizer, and modify WordPieceTokenizer to accept UTF32 as input.
Only call sub.string() once in WordPieceTokenizer.
Remove input validation in WhitespaceTokenizer which may be called many times.
Not found any differences in tokenizations I've tried. The changes in BasicTokenizer should be fine if WhitespaceTokenizer.tokenize commutes with the operations that follow (lower,nfd,splitting on punctuation) since there is already a final call to WhitespaceTokenizer.tokenize. The change in WordPieceTokenizer/FullTokenizer is fine given WhitespaceTokenizer.tokenize(WhitespaceTokenizer.tokenize(x)) = WhitespaceTokenizer.tokenize(x) in every case (which seems reasonable).
A few things that speed up the tokenizer, about 2-3x from cases I've checked:
BasicTokenizer
UTF32
in one call inFullTokenizer
, and modifyWordPieceTokenizer
to acceptUTF32
as input.sub.string()
once inWordPieceTokenizer
.WhitespaceTokenizer
which may be called many times.Not found any differences in tokenizations I've tried. The changes in
BasicTokenizer
should be fine ifWhitespaceTokenizer.tokenize
commutes with the operations that follow (lower
,nfd
,splitting on punctuation) since there is already a final call toWhitespaceTokenizer.tokenize
. The change inWordPieceTokenizer
/FullTokenizer
is fine givenWhitespaceTokenizer.tokenize(WhitespaceTokenizer.tokenize(x)) = WhitespaceTokenizer.tokenize(x)
in every case (which seems reasonable).