Closed scruel closed 2 months ago
We shall copy https://github.com/huggingface/transformers/blob/ee2a3400f2a7038a23b83a39c5d0e24f7f699561/src/transformers/tokenization_utils_base.py#L742-L745 above https://github.com/huggingface/transformers/blob/ee2a3400f2a7038a23b83a39c5d0e24f7f699561/src/transformers/tokenization_utils_base.py#L693-L694
Not sure about the usecase, but that is indeed a bug. Would you like to open a PR to make sure that:
from transformers import BatchEncoding
BatchEncoding({'input_ids': [1]}, prepend_batch_axis=True)
{'input_ids': [[1]]}
is respected?
Hi! Can I take this?
@mig-mfreitas Sure!
No problem! In that case, I'd love to help, as it is my first contribution. I appreciate your help on reviewing it.
Hi! Can I take this?
Oh, sorry I am little bit busy recently so that this still haven't got fixed, even though I plan to create a PR for this at this weekend, but yes, you can take it if you wish.
@scruel @mig-mfreitas For reference, we attribute work based on PRs opened, rather than claiming on issues i.e. the first PR opened will get priority. This prevents issues getting stale without a PR
@amyeroberts ok, I expect we can resolve this issue within this week, so if @mig-mfreitas haven't create a PR by this Saturday, I will take it again and create a PR on Sunday.π€
No problem, will start working right away. π
Using @scruel's solution the bug is fixed, but consequently a lot of tests fail on the account they expect an output such as [1, 2, 3] but get [[1, 2, 3]] instead. This issue might be related to the PreTrainedTokenizer class (and its subclasses) setting prepend_batch_axis as True by default in the encode_plus method, and so some scenarios where neither tensor_type nor prepend_batch_axis are explicitly defined will now return lists with a prepended batch axis when they previously didn't (as tensor_type defaults to None and prepend_batch_axis is set to True). If dozens of tests are expecting, in these scenarios, python lists with no batch axis, is it viable to implement this update? I will continue to explore this issue and would be grateful for your insight!
Hi! is this still open? I'd also like to help on this.
Just linked the PR sorry. I don't think it's gonna be an easy fix given the number of failing testsπ
i think we should replace the current part of code with this one
Hi, is this issue still open? It's my first contribution and I would love to help.
It's still open but as I mentionned the fix is very involved
hi!! can I fix it?
Feel free to open a PR and link this issue!
Heys @ArthurZucker and everyone else, I don't think this is an bug or issue, looks like the expected output to me.
I digged the code a bit, bot tokenizer __call__
and BatchEncoding
class. BatchEncoding
doc string states that , prepend_batch_axis
is applicable when the return type is a tensor. Otherwise this param is not applicable. Here is the doc:
In the __init__
:
https://github.com/huggingface/transformers/blob/e28784f821b14e84b96ce94c5d8a23b72741cf2d/src/transformers/tokenization_utils_base.py#L227
When I jump to this method: https://github.com/huggingface/transformers/blob/e28784f821b14e84b96ce94c5d8a23b72741cf2d/src/transformers/tokenization_utils_base.py#L684-L698
So, clearly if there's no tensor return, then this parameter is supposed not to do anything.
When I test the code:
>>> from transformers import AutoTokenizer, BatchEncoding
>>> sentence = "I go there"
>>> encoded = tokenizer(sentence)
>>> encoded
{'input_ids': [101, 1045, 2175, 2045, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}
>>> b_encoded = BatchEncoding(encoded, prepend_batch_axis=True)
>>> b_encoded
{'input_ids': [101, 1045, 2175, 2045, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]} # No batch axis, expected
>>> b_encoded = BatchEncoding(encoded, prepend_batch_axis=True, tensor_type="pt")
>>> b_encoded
{'input_ids': tensor([[ 101, 1045, 2175, 2045, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]]) # Now batch axis!
also for only one token sequence:
>>> sentence = "I"
>>> encoded = tokenizer(sentence, add_special_tokens=False)
>>> encoded
{'input_ids': [1045], 'token_type_ids': [0], 'attention_mask': [1]}
>>> b_encoded = BatchEncoding(encoded, prepend_batch_axis=True)
>>> b_encoded
{'input_ids': [1045], 'token_type_ids': [0], 'attention_mask': [1]} # As expected, no batch axis
>>> b_encoded = BatchEncoding(encoded, prepend_batch_axis=True, tensor_type="pt")
>>> b_encoded
{'input_ids': tensor([[1045]]), 'token_type_ids': tensor([[0]]), 'attention_mask': tensor([[1]])} # As expected, batch axis is here
As I said, looks like the expected behavior to me. What do you think? :relaxed: Am I getting it wrong?
Yeah, sounds good to me! In that case, it would make sense to update the doc to make this in bold π€ thanks for digging!
Yeah, sounds good to me! In that case, it would make sense to update the doc to make this in bold π€ thanks for digging!
Purrrfect, I'll make a small doc PR to here soon!
System Info
transformers
version: 4.35.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Output should be: