[Marked for Deprecation. please visit https://github.com/brain-score/language for the migrated project] Benchmarking of Language Models using Human Neural and Behavioral experiment data
Tokenized lengths of words may be different depending on the context, as illustrated by the example below.
A source of possible inconsistency in our ANN encode method: the tokenization of a word is non-uniform across contexts, so we can't rely on the tokenized length of the individual token to calculate offsets.
In [1]: from transformers import AutoModel, AutoTokenizer
t
In [2]: t = AutoTokenizer.from_pretrained('distilgpt2')
In [3]: t.decode([30119, 9015, 354, 5973])
Out[3]: 'past chickenchicken'
In [4]: t.decode(1)
Out[4]: '"'
In [5]: t.decode([354])
Out[5]: 'ch'
In [6]: t.decode([354, 5973])
Out[6]: 'chicken'
In [7]: [*map(t.decode, [354, 5973])]
Out[7]: ['ch', 'icken']
In [8]: t('chicken')
Out[8]: {'input_ids': [354, 5973], 'attention_mask': [1, 1]}
In [9]: t('tasty chicken')
Out[9]: {'input_ids': [83, 7833, 9015], 'attention_mask': [1, 1, 1]}
In [10]: t.decode([9015])
Out[10]: ' chicken'
We need to evaluate the situations in which this would be an issue, and whether this will affect the output of encode
Tokenized lengths of words may be different depending on the context, as illustrated by the example below.
A source of possible inconsistency in our ANN
encode
method: the tokenization of a word is non-uniform across contexts, so we can't rely on the tokenized length of the individual token to calculate offsets.We need to evaluate the situations in which this would be an issue, and whether this will affect the output of
encode