Open ivlcic opened 9 months ago
Hey! Thanks for opening this issue. Would you like to dive in this and open a PR for a fix? It might be a known bug + overflowing tokens are not supported on all slow tokenizer. The fast is probably right behaviour
I don't know what is the correct behaviour. You can get the overflowing tokens from both tokenizers. It's just that the returned data structure needs to be more consistent. I prefer the fast tokenizers behaviour, but the BatchEncoding returns None for the overflowing_tokens and is inconsistent with the advertised API in reference help. I can try to fix this late in March, but I would appreciate your decision on which direction the API should go since I'm not an expert on transformers API.
@ArthurZucker @amyeroberts im interested in taking up this issue,
Just wanted to confirm something else as well, shouldnt the behavior of AutoTokenizer match the specific tokenizer? Eg I tried this
from transformers import BertTokenizer, BertTokenizerFast, BatchEncoding, AutoTokenizer
n_tok = AutoTokenizer.from_pretrained("bert-base-uncased")
f_tok = BertTokenizerFast.from_pretrained("bert-base-uncased")
text = "hey this is jino, im just reading the api dont mind me"
n_inputs: BatchEncoding = n_tok(text=text, add_special_tokens=True, max_length=6, truncation=True, padding='max_length', return_overflowing_tokens=True)
o = n_inputs.get("overflowing_tokens")
print(f'Overflowing {o}')
print(n_inputs)
Outputs(much different from using the BertTokenizer shown by nikola above) Overflowing None { "input_ids": [ [101, 10930, 10930, 2023, 2003, 102], [101, 9743, 2080, 2054, 2015, 102], [101, 2039, 9152, 23033, 2015, 102] ], "token_type_ids": [ [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0] ], "attention_mask": [ [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1] ], "overflow_to_sample_mapping": [0, 0, 0] }
Yes, fast and slow tokenizers are suppose to give a similar output (not the same format but all the overflow etc should)
Hi @ArthurZucker / @amyeroberts For the following code when slow tokenizer is used,
from transformers import BertTokenizer, BertTokenizerFast, BatchEncoding, AutoTokenizer
n_tok = BertTokenizer.from_pretrained("bert-base-uncased")
f_tok = BertTokenizerFast.from_pretrained("bert-base-uncased")
text = "hey this is jino, im just reading the api dont mind me"
n_inputs: BatchEncoding = n_tok(text=text, add_special_tokens=True, max_length=6, truncation=True, padding='max_length', return_overflowing_tokens=True)
o = n_inputs.get("overflowing_tokens")
print(f'Overflowing {o}')
print(n_inputs)`
The following is the output,
Overflowing [2080, 1010, 10047, 2074, 3752, 1996, 17928, 2123, 2102, 2568, 2033] {'overflowing_tokens': [2080, 1010, 10047, 2074, 3752, 1996, 17928, 2123, 2102, 2568, 2033], 'num_truncated_tokens': 11, 'input_ids': [101, 4931, 2023, 2003, 9743, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}
What we notice is that, this is different from the output of Fast Tokenizer where the overflowing tokens are split into multiple batches of max sequence length and appended to input_ids. Do we want the Slow tokenizer to behave similar to the Fast one as well or is this the expected behavior?
In an optimal word, we want the slow to match the fast! I am not ceratin in this specific case which is "expected" or not 😅
Hi @ArthurZucker / @amyeroberts, this is the commit which changed the behavior, i.e. the overflowing tokens are not returned in the dictionary under "overflowing_tokens" key anymore. As people were already asking, what is the preferred approach? Options:
System Info
transformers
version: 4.37.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
For the
n_inputs['input_ids']
we get[101, 7592, 2026, 2171, 2003, 102]
, and for thef_inputs['input_ids']
we get[[101, 7592, 2026, 2171, 2003, 102], [101, 24794, 1998, 1045, 2139, 102], [101, 8569, 2290, 19081, 2085, 102]]
. Outputs should be the same.