The current implementation using AutoTokenizer has some differences compared to previous model-based tokenizers. Harsha and I looked at the tokenizers we are using for gpt2-xl.
One of the differences is that AutoTokenizers default to 'use_fast=True', meaning using tokenizerfast instead of tokenizer.
The other difference is that AutoTokenizers default to 'add_prefix_space=False', meaning not adding spaces in front of words, which will result in different token ids for the same word at the start of the sentence and in the middle of the sentence.
I'm not sure what AutoTokenizer defaults to for other models but will look into blenderbot and bert myself.
@miahong if you have time, could you look into this issue for other models? Thanks!
The current implementation using AutoTokenizer has some differences compared to previous model-based tokenizers. Harsha and I looked at the tokenizers we are using for gpt2-xl.
One of the differences is that AutoTokenizers default to 'use_fast=True', meaning using tokenizerfast instead of tokenizer.
The other difference is that AutoTokenizers default to 'add_prefix_space=False', meaning not adding spaces in front of words, which will result in different token ids for the same word at the start of the sentence and in the middle of the sentence.
We don't think these two parameters will make a lot of difference in the encoding results, but for future embedding generation, we probably want the tokenizer to have 'add_prefix_space=False' and 'use_fast=False'. So we will need to add 'use_fast=True' when calling the AutoTokenizer: https://github.com/hassonlab/247-pickling/blob/365f0e97dee41cbf5f70464b3c52aabbb6a05c85/scripts/tfsemb_download.py#L56
I'm not sure what AutoTokenizer defaults to for other models but will look into blenderbot and bert myself. @miahong if you have time, could you look into this issue for other models? Thanks!