matlab-deep-learning / transformer-models

Deep Learning Transformer models in MATLAB
Other
206 stars 61 forks source link

Tokenizer optimizations #11

Closed bwdGitHub closed 3 years ago

bwdGitHub commented 3 years ago

A few things that speed up the tokenizer, about 2-3x from cases I've checked:

  1. Remove redundant white-space tokenization in BasicTokenizer
  2. Convert basic tokenized tokens to UTF32 in one call in FullTokenizer, and modify WordPieceTokenizer to accept UTF32 as input.
  3. Only call sub.string() once in WordPieceTokenizer.
  4. Remove input validation in WhitespaceTokenizer which may be called many times.

Not found any differences in tokenizations I've tried. The changes in BasicTokenizer should be fine if WhitespaceTokenizer.tokenize commutes with the operations that follow (lower,nfd,splitting on punctuation) since there is already a final call to WhitespaceTokenizer.tokenize. The change in WordPieceTokenizer/FullTokenizer is fine given WhitespaceTokenizer.tokenize(WhitespaceTokenizer.tokenize(x)) = WhitespaceTokenizer.tokenize(x) in every case (which seems reasonable).