fix handling of added special tokens in tokenizers

helpmefindaname / transformer-smaller-training-vocab

Temporary remove unused tokens during training to save ram and speed.

https://helpmefindaname.github.io/transformer-smaller-training-vocab/

MIT License

20 stars 2 forks source link

fix handling of added special tokens in tokenizers #6

Closed helpmefindaname closed 1 year ago

helpmefindaname commented 1 year ago

Adding special tokens via tokenizer.add_special_tokens({"additional_special_tokens": added_tokens}) adds special tokens to the end. For some tokenizers, the ids were not updated and therefore resulted in index errors when used.

This PR correctly updates the ids of the special tokens for all implemented tokenizers