We cannot rely on out-of-context tokenization to calculate tokenized offset lengths

Tokenized lengths of words may be different depending on the context, as illustrated by the example below.

A source of possible inconsistency in our ANN encode method: the tokenization of a word is non-uniform across contexts, so we can't rely on the tokenized length of the individual token to calculate offsets.

In [1]: from transformers import AutoModel, AutoTokenizer
t 
In [2]: t = AutoTokenizer.from_pretrained('distilgpt2')

In [3]: t.decode([30119, 9015, 354, 5973])
Out[3]: 'past chickenchicken'

In [4]: t.decode(1)
Out[4]: '"'

In [5]: t.decode([354])
Out[5]: 'ch'

In [6]: t.decode([354, 5973])
Out[6]: 'chicken'

In [7]: [*map(t.decode, [354, 5973])]
Out[7]: ['ch', 'icken']

In [8]: t('chicken')
Out[8]: {'input_ids': [354, 5973], 'attention_mask': [1, 1]}

In [9]: t('tasty chicken')
Out[9]: {'input_ids': [83, 7833, 9015], 'attention_mask': [1, 1, 1]}

In [10]: t.decode([9015])
Out[10]: ' chicken'

We need to evaluate the situations in which this would be an issue, and whether this will affect the output of encode

language-brainscore / langbrainscore

We cannot rely on out-of-context tokenization to calculate tokenized offset lengths #31