Open scruel opened 9 months ago
Why only remove the leading batch axis while return tensor is None
? I mean consider the annotation of text
parameter of _encode_plus
method, we won't need batch axis at all, so why not remove it for all return tensor types?
Fully agree with you. Encode can take batched, decode only a single. batch_decode
is the one that support a batch of inputs. It has been that way for a WHILE now. I plan to deprecate some of this and simplify the API in favor of just encode decode that can support batches and singles.
Would you like to work on a fix for this? 🤗
Sure, already working on it, I will create a PR later today.
Hi @ArthurZucker,
Should we keep related APIs as the same as huggingface/tokenizers
lib? Consider if we only have one (de)encode
method which not just process single but batches, it will be impossible to distinguish the difference between List[TextInput]
and PreTokenizedInput
with its own power without changing the logic or adding extra parameters, consider they are just the same thing if we check their type, so I think:
batch
parameter to the (de)encode
methods, and keeping/adding the (de)encode_batch
methods. This way won't "solve" this issue, cos decode won't be able to know what's the method generated the input, consider we won't add extra info fields to encode
method, however, it should be acceptable if we well documented it, and such scenario will only happen if users misuse the methods (e.g., they use decode
method to decode the encode_batch
returns).
(WIP on this without thinking about the backward compatibilities).list[str]
type input, we only treat it as batched strings, rather than pre-tokenized single string. This way will solve this issue, but we will lose the ability to process either list[TextInput]
or PreTokenizedInput
with encode
method.BTW, I wonder to ask why you decided to only remove list-object's leading batch axis in encode
method? May it can provide some convince for users while using it?
Enum
s rather than use "magic strings" (e.g., utils/generic/infer_framework_XXX
) everywhere in the project (for inner methods, fine for API to provide user convenience), during the refracting period, I found lots of the legacy code that used "magic strings", even for condition tests.abc
stdlib to define tokenizer base class? (Just a nit, for some cases we won't use it because it is not suitable for some classes, like the Dataset
class in PyTorch project)if is_flax_available():
import jax.numpy as jnp
Then why we have import tensorflow
everywhere rather than do the same at the top of the source code via is_tf_available
?
BatchEncoding
to EntryEncoding
: name "Batch" is not suitable for the encode
method, and the result will not always be batched.TensorType
to TArrayType
: added new Python SEQUENCE
enum value.as_tensor
/is_tensor
to generic.pySorry for such a big change in the API I'd rather take care of it, I was more referring to a small PR that supports for now decoding a batch of inputs with decode!
About all the points you mentioned, for me the direction of the library is a bit different. Appart from removing calls to TF everywhere which indeed should be protected the same way as flax!
Sorry for such a big change in the API I'd rather take care of it, I was more referring to a small PR that supports for now decoding a batch of inputs with decode!
decode
itself can't handle such tasks well, as I mentioned before:
it will be impossible to distinguish the difference between List[TextInput] and PreTokenizedInput with its own power without changing the logic or adding extra parameters
So yes, it becomes a "big change", since I already created a PR, you may take care of this based on it, no need to sorry 🤗
the direction of the library is a bit different
Can you explain this more? I think most of the points are just about the code style, so it won't affect the direction of anything, but may improve the maintainability :)
Appart from removing calls to TF everywhere which indeed should be protected the same way as flax!
Cool, I'm glad that you also think so, we'd better consider this more, consider having import
statement for frequently used functions is definitely a bad idea coz the function has to look up those libraries every time when it gets called, even Python will only do one true import for one library (heavy operation, so we may also need to consider, when we should have the necessary true import).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.35.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run the following code:
Will raise exception:
Expected behavior
Should be able to print the original text
"test"
, rather than raise an exception(TypeError
).