Closed wmathor closed 2 years ago
I'm also slightly confused by the warning. If I'm reading the source code correctly, this is really only a warning and is not triggered if there is an actual issue. It's slightly confusing because you get it many times when you tokenize a lot of text and it feels like there is something wrong. Maybe it can just be returned once? https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_utils_base.py
I'm also slightly confused by the warning. If I'm reading the source code correctly, this is really only a warning and is not triggered if there is an actual issue. It's slightly confusing because you get it many times when you tokenize a lot of text and it feels like there is something wrong. Maybe it can just be returned once? https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_utils_base.py
I try to modify the source code of tokenization_utils_base.py
, delete the warning code segment. It works!
Set the verbosity level as follows:
transformers.logging.set_verbosity_error()
Set the verbosity level as follows:
transformers.logging.set_verbosity_error()
thank you so much!
I'm using the Trainer class with a dataset that I stream (as it is too large) and perform on-the-fly tokenization (i.e. each mini-batch is passed to the tokenizer). Sadly this warning appears constantly with every mini-batch which is quite annoying.
Sadly transformers.logging.set_verbosity_error()
doesn't work in the multi-process setup (I use the trainer with multiple GPUs). Also it removes logs from the trainer that are quite relevant / interesting.
Would be great if this warning could be removed / or changed so that it is only printed once.
Indeed, it could be done by using the following variable:
You can see an example of it used here:
Feel free to open a PR to offer this change!
Thanks for the pointer @LysandreJik
Will create a PR on this
This warning can be pretty noisy when your batch size is low, and the dataset is big. It would be nice only to see this warning once, as nreimers mentioned.
For anyone coming from Google who cannot suppress the error with eduOS's solution. The nuclear option is to disable all warnings in Python like this:
import logging
logging.disable(logging.WARNING)
I'm just confused by this sentence in the warning
So the returned list will always be empty even if some tokens have been removed. Does it mean I will get a empty returned list ??
I upvote for the last message. I've started getting this error when was using tokenizer with text_pair
(context) argument.
And after that I've tried to decode messages and... have got a 2/3 of them empty. Why? How to prevent that? It was ok without text_pair
arg.
Hi @nreimers @LysandreJik and others. This issue is still open and I found no respective deprecation warning in main. One quick fix that doesn't affect multiprocessing and global logging (at least not forever) is to set the logging level only before tokenization and restore it later. Yes, it is frustrating but it seems to work, e.g.:
old_level = transformers.logging.get_verbosity()
transformers.logging.set_verbosity_error()
res : BatchEncoding = tok.batch_encode_plus(batch_text_or_text_pairs=input_list,
padding='longest',
#truncation='only_second',
truncation='longest_first',
return_tensors='pt')
transformers.logging.set_verbosity(old_level)