`BatchEncoding` won't prepend batch axis to python list, even `prepend_batch_axis` is `True`

scruel commented 8 months ago

System Info

transformers version: 4.35.2
Platform: Linux-6.5.0-14-generic-x86_64-with-glibc2.35
Python version: 3.11.6
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.1
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

>>> BatchEncoding({'input_ids': [1]}, prepend_batch_axis=True)
{'input_ids': [1]}

Expected behavior

Output should be:

{'input_ids': [[1]]}

scruel commented 8 months ago

We shall copy https://github.com/huggingface/transformers/blob/ee2a3400f2a7038a23b83a39c5d0e24f7f699561/src/transformers/tokenization_utils_base.py#L742-L745 above https://github.com/huggingface/transformers/blob/ee2a3400f2a7038a23b83a39c5d0e24f7f699561/src/transformers/tokenization_utils_base.py#L693-L694

ArthurZucker commented 8 months ago

Not sure about the usecase, but that is indeed a bug. Would you like to open a PR to make sure that:

from transformers import BatchEncoding
BatchEncoding({'input_ids': [1]}, prepend_batch_axis=True)
{'input_ids': [[1]]}

is respected?

mig-mfreitas commented 7 months ago

Hi! Can I take this?

amyeroberts commented 7 months ago

@mig-mfreitas Sure!

mig-mfreitas commented 7 months ago

No problem! In that case, I'd love to help, as it is my first contribution. I appreciate your help on reviewing it.

scruel commented 7 months ago

Hi! Can I take this?

Oh, sorry I am little bit busy recently so that this still haven't got fixed, even though I plan to create a PR for this at this weekend, but yes, you can take it if you wish.

amyeroberts commented 7 months ago

@scruel @mig-mfreitas For reference, we attribute work based on PRs opened, rather than claiming on issues i.e. the first PR opened will get priority. This prevents issues getting stale without a PR

scruel commented 7 months ago

@amyeroberts ok, I expect we can resolve this issue within this week, so if @mig-mfreitas haven't create a PR by this Saturday, I will take it again and create a PR on Sunday.🤗

mig-mfreitas commented 7 months ago

No problem, will start working right away. 👍

mig-mfreitas commented 7 months ago

Using @scruel's solution the bug is fixed, but consequently a lot of tests fail on the account they expect an output such as [1, 2, 3] but get [[1, 2, 3]] instead. This issue might be related to the PreTrainedTokenizer class (and its subclasses) setting prepend_batch_axis as True by default in the encode_plus method, and so some scenarios where neither tensor_type nor prepend_batch_axis are explicitly defined will now return lists with a prepended batch axis when they previously didn't (as tensor_type defaults to None and prepend_batch_axis is set to True). If dozens of tests are expecting, in these scenarios, python lists with no batch axis, is it viable to implement this update? I will continue to explore this issue and would be grateful for your insight!

noobjam commented 7 months ago

Hi! is this still open? I'd also like to help on this.

ArthurZucker commented 7 months ago

Just linked the PR sorry. I don't think it's gonna be an easy fix given the number of failing tests😅

pratyakshagarwal commented 6 months ago

Screenshot 2024-04-06 002024 i think we should replace the current part of code with this one

LavGupta01 commented 6 months ago

Hi, is this issue still open? It's my first contribution and I would love to help.

ArthurZucker commented 6 months ago

It's still open but as I mentionned the fix is very involved

escalmat commented 4 months ago

hi!! can I fix it?

ArthurZucker commented 2 months ago

Feel free to open a PR and link this issue!

DuyguA commented 2 months ago

Heys @ArthurZucker and everyone else, I don't think this is an bug or issue, looks like the expected output to me.

I digged the code a bit, bot tokenizer __call__ and BatchEncoding class. BatchEncoding doc string states that , prepend_batch_axis is applicable when the return type is a tensor. Otherwise this param is not applicable. Here is the doc:

https://github.com/huggingface/transformers/blob/e28784f821b14e84b96ce94c5d8a23b72741cf2d/src/transformers/tokenization_utils_base.py#L200-L201

In the __init__: https://github.com/huggingface/transformers/blob/e28784f821b14e84b96ce94c5d8a23b72741cf2d/src/transformers/tokenization_utils_base.py#L227

When I jump to this method: https://github.com/huggingface/transformers/blob/e28784f821b14e84b96ce94c5d8a23b72741cf2d/src/transformers/tokenization_utils_base.py#L684-L698

So, clearly if there's no tensor return, then this parameter is supposed not to do anything.

When I test the code:

>>> from transformers import AutoTokenizer, BatchEncoding
>>> sentence = "I go there"
>>> encoded  = tokenizer(sentence)
>>> encoded
{'input_ids': [101, 1045, 2175, 2045, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}
>>> b_encoded = BatchEncoding(encoded, prepend_batch_axis=True)
>>> b_encoded
{'input_ids': [101, 1045, 2175, 2045, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]} # No batch axis, expected
>>> b_encoded = BatchEncoding(encoded, prepend_batch_axis=True, tensor_type="pt")
>>> b_encoded
{'input_ids': tensor([[ 101, 1045, 2175, 2045,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]]) # Now batch axis!

also for only one token sequence:

>>> sentence = "I"
>>> encoded  = tokenizer(sentence, add_special_tokens=False)
>>> encoded
{'input_ids': [1045], 'token_type_ids': [0], 'attention_mask': [1]}
>>> b_encoded = BatchEncoding(encoded, prepend_batch_axis=True)
>>> b_encoded
{'input_ids': [1045], 'token_type_ids': [0], 'attention_mask': [1]}  # As expected, no batch axis
>>> b_encoded = BatchEncoding(encoded, prepend_batch_axis=True, tensor_type="pt")
>>> b_encoded
{'input_ids': tensor([[1045]]), 'token_type_ids': tensor([[0]]), 'attention_mask': tensor([[1]])}  # As expected, batch axis is here

As I said, looks like the expected behavior to me. What do you think? :relaxed: Am I getting it wrong?

ArthurZucker commented 2 months ago

Yeah, sounds good to me! In that case, it would make sense to update the doc to make this in bold 🤗 thanks for digging!

DuyguA commented 2 months ago

Yeah, sounds good to me! In that case, it would make sense to update the doc to make this in bold 🤗 thanks for digging!

Purrrfect, I'll make a small doc PR to here soon!

huggingface / transformers