changes the base SPM vectorizer to support adding extra_tokens. This is needed when the SPM vocab builder
didnt add tokens like <pad>, <offset>, <bos> and <eos> which are supported by the SPM library, and may be
useful for other models (like MLMs where we might want to add a [MASK] and [CLS] token). To make it simpler to add
all offsets at once, added support for a magic extra_token called {{OFFSETS}} (case insensitive) that expands to Offsets.VALUES. Also since SPM allows users to redefine the offset values to any integers, make _REWIRE_GLOBAL_OFFSETS() conditionally reset the integer values of Offset.VALUES to the SPM defined ones if they arent contained in extra_tokens
also change the default option for GPT2 style tokenization to False to be consistent with other parts of the lib.
finally, adds a policy for handling extra_tokens for BPE and SPM models (including any sub-flavor of SPM):
if it already exists, it should not try and add it to the vocab
if it doesnt exist, it should be prepended
whether or not it exists, it should be added to special tokens
changes the base SPM vectorizer to support adding extra_tokens. This is needed when the SPM vocab builder didnt add tokens like
<pad>
,<offset>
,<bos>
and<eos>
which are supported by the SPM library, and may be useful for other models (like MLMs where we might want to add a[MASK]
and[CLS]
token). To make it simpler to add all offsets at once, added support for a magicextra_token
called{{OFFSETS}}
(case insensitive) that expands toOffsets.VALUES
. Also since SPM allows users to redefine the offset values to any integers, make _REWIRE_GLOBAL_OFFSETS() conditionally reset the integer values of Offset.VALUES to the SPM defined ones if they arent contained inextra_tokens
also change the default option for GPT2 style tokenization to False to be consistent with other parts of the lib.
finally, adds a policy for handling extra_tokens for BPE and SPM models (including any sub-flavor of SPM):