huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.48k stars 26.66k forks source link

BertTokenizer and BertTokenizerFast have different behavior when requested "return_overflowing_tokens" #28900

Open ivlcic opened 8 months ago

ivlcic commented 8 months ago

System Info

Who can help?

@ArthurZucker

Information

Tasks

Reproduction

from transformers import BertTokenizer, BertTokenizerFast, BatchEncoding
n_tok = BertTokenizer.from_pretrained("bert-base-uncased")
f_tok = BertTokenizerFast.from_pretrained("bert-base-uncased")

text = "hello my name is nikola and i debug transformers now"

n_inputs: BatchEncoding = n_tok(text=text, add_special_tokens=True, max_length=6, truncation=True, padding='max_length', return_overflowing_tokens=True)
o = n_inputs.get("overflowing_tokens")
print(f'Overflowing {o}')
n_inputs['input_ids']

f_inputs: BatchEncoding = f_tok(text=text, add_special_tokens=True, max_length=6, truncation=True, padding='max_length', return_overflowing_tokens=True)
o = f_inputs.get("overflowing_tokens")
print(f'Overflowing {o}')
f_inputs['input_ids']

Expected behavior

For the n_inputs['input_ids'] we get [101, 7592, 2026, 2171, 2003, 102], and for the f_inputs['input_ids'] we get [[101, 7592, 2026, 2171, 2003, 102], [101, 24794, 1998, 1045, 2139, 102], [101, 8569, 2290, 19081, 2085, 102]]. Outputs should be the same.

ArthurZucker commented 8 months ago

Hey! Thanks for opening this issue. Would you like to dive in this and open a PR for a fix? It might be a known bug + overflowing tokens are not supported on all slow tokenizer. The fast is probably right behaviour

ivlcic commented 8 months ago

I don't know what is the correct behaviour. You can get the overflowing tokens from both tokenizers. It's just that the returned data structure needs to be more consistent. I prefer the fast tokenizers behaviour, but the BatchEncoding returns None for the overflowing_tokens and is inconsistent with the advertised API in reference help. I can try to fix this late in March, but I would appreciate your decision on which direction the API should go since I'm not an expert on transformers API.

JINO-ROHIT commented 6 months ago

@ArthurZucker @amyeroberts im interested in taking up this issue,

Just wanted to confirm something else as well, shouldnt the behavior of AutoTokenizer match the specific tokenizer? Eg I tried this

from transformers import BertTokenizer, BertTokenizerFast, BatchEncoding, AutoTokenizer
n_tok = AutoTokenizer.from_pretrained("bert-base-uncased")
f_tok = BertTokenizerFast.from_pretrained("bert-base-uncased")

text = "hey this is jino, im just reading the api dont mind me"

n_inputs: BatchEncoding = n_tok(text=text, add_special_tokens=True, max_length=6, truncation=True, padding='max_length', return_overflowing_tokens=True)
o = n_inputs.get("overflowing_tokens")
print(f'Overflowing {o}')
print(n_inputs)

Outputs(much different from using the BertTokenizer shown by nikola above) Overflowing None { "input_ids": [ [101, 10930, 10930, 2023, 2003, 102], [101, 9743, 2080, 2054, 2015, 102], [101, 2039, 9152, 23033, 2015, 102] ], "token_type_ids": [ [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0] ], "attention_mask": [ [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1] ], "overflow_to_sample_mapping": [0, 0, 0] }

ArthurZucker commented 5 months ago

Yes, fast and slow tokenizers are suppose to give a similar output (not the same format but all the overflow etc should)

bayllama commented 3 months ago

Hi @ArthurZucker / @amyeroberts For the following code when slow tokenizer is used,

from transformers import BertTokenizer, BertTokenizerFast, BatchEncoding, AutoTokenizer
n_tok = BertTokenizer.from_pretrained("bert-base-uncased")
f_tok = BertTokenizerFast.from_pretrained("bert-base-uncased")

text = "hey this is jino, im just reading the api dont mind me"

n_inputs: BatchEncoding = n_tok(text=text, add_special_tokens=True, max_length=6, truncation=True, padding='max_length', return_overflowing_tokens=True)
o = n_inputs.get("overflowing_tokens")
print(f'Overflowing {o}')
print(n_inputs)`

The following is the output, Overflowing [2080, 1010, 10047, 2074, 3752, 1996, 17928, 2123, 2102, 2568, 2033] {'overflowing_tokens': [2080, 1010, 10047, 2074, 3752, 1996, 17928, 2123, 2102, 2568, 2033], 'num_truncated_tokens': 11, 'input_ids': [101, 4931, 2023, 2003, 9743, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]} What we notice is that, this is different from the output of Fast Tokenizer where the overflowing tokens are split into multiple batches of max sequence length and appended to input_ids. Do we want the Slow tokenizer to behave similar to the Fast one as well or is this the expected behavior?

ArthurZucker commented 2 months ago

In an optimal word, we want the slow to match the fast! I am not ceratin in this specific case which is "expected" or not 😅