hassonlab / 247-pickling

Contains code to create pickles from raw/processed data
1 stars 10 forks source link

Tokenizer vs fast tokenizer #81

Closed VeritasJoker closed 2 years ago

VeritasJoker commented 2 years ago

The current implementation using AutoTokenizer has some differences compared to previous model-based tokenizers. Harsha and I looked at the tokenizers we are using for gpt2-xl.

One of the differences is that AutoTokenizers default to 'use_fast=True', meaning using tokenizerfast instead of tokenizer.

The other difference is that AutoTokenizers default to 'add_prefix_space=False', meaning not adding spaces in front of words, which will result in different token ids for the same word at the start of the sentence and in the middle of the sentence.

We don't think these two parameters will make a lot of difference in the encoding results, but for future embedding generation, we probably want the tokenizer to have 'add_prefix_space=False' and 'use_fast=False'. So we will need to add 'use_fast=True' when calling the AutoTokenizer: https://github.com/hassonlab/247-pickling/blob/365f0e97dee41cbf5f70464b3c52aabbb6a05c85/scripts/tfsemb_download.py#L56

I'm not sure what AutoTokenizer defaults to for other models but will look into blenderbot and bert myself. @miahong if you have time, could you look into this issue for other models? Thanks!

hvgazula commented 2 years ago

@VeritasJoker Can you please share the notebook here? I tried to test it on my end and I did not see any difference in the tokenizer encoding.

VeritasJoker commented 2 years ago

https://github.com/VeritasJoker/247-pickling/blob/bert/misc/gpt2_test.ipynb