Tokenizer `encode/decode` methods are inconsistent, TypeError: argument 'ids': 'list' object cannot be interpreted as an integer

scruel commented 9 months ago

System Info

transformers version: 4.35.2
Platform: Linux-6.5.0-14-generic-x86_64-with-glibc2.35
Python version: 3.11.6
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.1
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Run the following code:

from transformers import AutoTokenizer

text = "test"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
encoded = tokenizer.encode(text, return_tensors='pt')
result_text = tokenizer.decode(encoded, skip_special_tokens=True)
print(text)

Will raise exception:

Traceback (most recent call last):
  File "main.py", line 8, in <module>
    tokenizer.decode(encoded, skip_special_tokens=True)
  File "/home/scruel/mambaforge/envs/vae/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3748, in decode
    return self._decode(
           ^^^^^^^^^^^^^
  File "/home/scruel/mambaforge/envs/vae/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 625, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument 'ids': 'list' object cannot be interpreted as an integer

Expected behavior

Should be able to print the original text "test", rather than raise an exception(TypeError).

scruel commented 9 months ago

https://github.com/huggingface/transformers/blob/83f9196cc44a612ef2bd5a0f721d08cb24885c1f/src/transformers/tokenization_utils_fast.py#L596-L605

Why only remove the leading batch axis while return tensor is None? I mean consider the annotation of text parameter of _encode_plus method, we won't need batch axis at all, so why not remove it for all return tensor types?

ArthurZucker commented 9 months ago

Fully agree with you. Encode can take batched, decode only a single. batch_decode is the one that support a batch of inputs. It has been that way for a WHILE now. I plan to deprecate some of this and simplify the API in favor of just encode decode that can support batches and singles.

Would you like to work on a fix for this? 🤗

scruel commented 9 months ago

Sure, already working on it, I will create a PR later today.

scruel commented 9 months ago

Hi @ArthurZucker,

Should we keep related APIs as the same as huggingface/tokenizers lib? Consider if we only have one (de)encode method which not just process single but batches, it will be impossible to distinguish the difference between List[TextInput] and PreTokenizedInput with its own power without changing the logic or adding extra parameters, consider they are just the same thing if we check their type, so I think:
- We add a batch parameter to the (de)encode methods, and keeping/adding the (de)encode_batch methods. This way won't "solve" this issue, cos decode won't be able to know what's the method generated the input, consider we won't add extra info fields to encode method, however, it should be acceptable if we well documented it, and such scenario will only happen if users misuse the methods (e.g., they use decode method to decode the encode_batch returns). (WIP on this without thinking about the backward compatibilities).
- We change the logic of the function, like if the method receives list[str] type input, we only treat it as batched strings, rather than pre-tokenized single string. This way will solve this issue, but we will lose the ability to process either list[TextInput] or PreTokenizedInput with encode method.
BTW, I wonder to ask why you decided to only remove list-object's leading batch axis in encode method? May it can provide some convince for users while using it?
I prefer to use Enums rather than use "magic strings" (e.g., utils/generic/infer_framework_XXX) everywhere in the project (for inner methods, fine for API to provide user convenience), during the refracting period, I found lots of the legacy code that used "magic strings", even for condition tests.
Can we try to use abc stdlib to define tokenizer base class? (Just a nit, for some cases we won't use it because it is not suitable for some classes, like the Dataset class in PyTorch project)
Consider we already have:
```
if is_flax_available():
import jax.numpy as jnp
```
Then why we have import tensorflow everywhere rather than do the same at the top of the source code via is_tf_available?
Consider replacing those monkey patches (e.g., logger) into subclasses/wrapper classes?
Wanna to rename the following classes (won't do now coz lots of files will be changed):
- BatchEncoding to EntryEncoding: name "Batch" is not suitable for the encode method, and the result will not always be batched.
- TensorType to TArrayType: added new Python SEQUENCE enum value.
move as_tensor/is_tensor to generic.py

ArthurZucker commented 8 months ago

Sorry for such a big change in the API I'd rather take care of it, I was more referring to a small PR that supports for now decoding a batch of inputs with decode!

About all the points you mentioned, for me the direction of the library is a bit different. Appart from removing calls to TF everywhere which indeed should be protected the same way as flax!

scruel commented 8 months ago

Sorry for such a big change in the API I'd rather take care of it, I was more referring to a small PR that supports for now decoding a batch of inputs with decode!

decode itself can't handle such tasks well, as I mentioned before:

it will be impossible to distinguish the difference between List[TextInput] and PreTokenizedInput with its own power without changing the logic or adding extra parameters

So yes, it becomes a "big change", since I already created a PR, you may take care of this based on it, no need to sorry 🤗

the direction of the library is a bit different

Can you explain this more? I think most of the points are just about the code style, so it won't affect the direction of anything, but may improve the maintainability :)

Appart from removing calls to TF everywhere which indeed should be protected the same way as flax!

Cool, I'm glad that you also think so, we'd better consider this more, consider having import statement for frequently used functions is definitely a bad idea coz the function has to look up those libraries every time when it gets called, even Python will only do one true import for one library (heavy operation, so we may also need to consider, when we should have the necessary true import).

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers