Open Acatsama0871 opened 4 days ago
We decode the token ID twice, once on its own, and once after the representation of 0, to understand whether its a start word token or not. Start word tokens will have an extra character when they are decoded after 0 compared to when they are not, and that is what we are checking. Some tokenizers expose this directly, but this method is tokenizer agnostic.
Hello,
I would first thank you for open-sourcing such a well-designed and high-quality code base.
I am reading the source code, and I have a question about this part(integrations.transformers.py):
Why the same token id is decoded twice here? And what does the "start word" mean in this context? thx