jalammar / ecco

Explain, analyze, and visualize NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BERT, RoBERTA, T5, and T0).
https://ecco.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.96k stars 167 forks source link

Tokenizer has partial token suffix instead of prefix #65

Open guustfranssensEY opened 2 years ago

guustfranssensEY commented 2 years ago

Following your guide for identifying model configuration

MODEL_ID = "vinai/bertweet-base"

from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, normalization=True, use_fast=False)
ids= tokenizer('tokenization')
ids

returns:

{'input_ids': [0, 969, 6186, 6680, 2], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

Then

tokenizer.convert_ids_to_tokens(ids['input_ids'])

returns:

['<s>', 'to@@', 'ken@@', 'ization', '</s>']

Here I noticed that the tokenizer adds a partial token suffix instead of partial token prefix. Having a suffix instead of prefix is not configurable in the config.

jalammar commented 2 years ago

Oh wow I've never come across such a tokenizer. That's interesting..