mead-ml / mead-baseline

Deep-Learning Model Exploration and Development for NLP
Apache License 2.0
243 stars 73 forks source link

Feature/add offsets to spm #911

Closed dpressel closed 2 years ago

dpressel commented 2 years ago

changes the base SPM vectorizer to support adding extra_tokens. This is needed when the SPM vocab builder didnt add tokens like <pad>, <offset>, <bos> and <eos> which are supported by the SPM library, and may be useful for other models (like MLMs where we might want to add a [MASK] and [CLS] token). To make it simpler to add all offsets at once, added support for a magic extra_token called {{OFFSETS}} (case insensitive) that expands to Offsets.VALUES. Also since SPM allows users to redefine the offset values to any integers, make _REWIRE_GLOBAL_OFFSETS() conditionally reset the integer values of Offset.VALUES to the SPM defined ones if they arent contained in extra_tokens

also change the default option for GPT2 style tokenization to False to be consistent with other parts of the lib.

finally, adds a policy for handling extra_tokens for BPE and SPM models (including any sub-flavor of SPM):

  1. if it already exists, it should not try and add it to the vocab
  2. if it doesnt exist, it should be prepended
  3. whether or not it exists, it should be added to special tokens