Closed kitkhai closed 10 months ago
Additionally, is there a way to retrieve (and edit) the merge rules from "slow" & "fast" tokenizers respectively?
Hey! Few things here. What you are trying to do is outside the scope of the supported features. Adding a token should be done using tokenizer.add_tokens
function.
The fast version is for me more right than what you expect. If there are no merges
, then there is absolutely no reason for the BPE model to fuse '▁super', 'long', 'word'
into superlongword
. Thus the slow version seems more wrong, and specifically because sentencepiece does not really allow adding tokens that way.
System Info
transformers
version: 4.35.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
I faced a similar issue as raised by a question in the HF forum where the OP trainer the tokenizer with user_defined_symbols while in my case I added to the SentencePiece model file directly without training.
Noted that I can just use the
add_tokens
method to achieve the same outcome but because of another issue that I raised #28218 , I would like to avoid the use ofadd_tokens
method if possible.