How to prevent tokenizer from outputting certain information

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

133.68k stars 26.71k forks source link

How to prevent tokenizer from outputting certain information #14285

Closed wmathor closed 2 years ago

wmathor commented 2 years ago

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

MoritzLaurer commented 2 years ago

I'm also slightly confused by the warning. If I'm reading the source code correctly, this is really only a warning and is not triggered if there is an actual issue. It's slightly confusing because you get it many times when you tokenize a lot of text and it feels like there is something wrong. Maybe it can just be returned once? https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_utils_base.py

wmathor commented 2 years ago

I'm also slightly confused by the warning. If I'm reading the source code correctly, this is really only a warning and is not triggered if there is an actual issue. It's slightly confusing because you get it many times when you tokenize a lot of text and it feels like there is something wrong. Maybe it can just be returned once? https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_utils_base.py

I try to modify the source code of tokenization_utils_base.py, delete the warning code segment. It works!

eduOS commented 2 years ago

Set the verbosity level as follows:

transformers.logging.set_verbosity_error()

wmathor commented 2 years ago

Set the verbosity level as follows:

transformers.logging.set_verbosity_error()

thank you so much!

nreimers commented 2 years ago

I'm using the Trainer class with a dataset that I stream (as it is too large) and perform on-the-fly tokenization (i.e. each mini-batch is passed to the tokenizer). Sadly this warning appears constantly with every mini-batch which is quite annoying.

Sadly transformers.logging.set_verbosity_error() doesn't work in the multi-process setup (I use the trainer with multiple GPUs). Also it removes logs from the trainer that are quite relevant / interesting.

Would be great if this warning could be removed / or changed so that it is only printed once.

LysandreJik commented 2 years ago

Indeed, it could be done by using the following variable:

https://github.com/huggingface/transformers/blob/68810aa26c083fd97d976cef7ac65fdd9cc9b520/src/transformers/tokenization_utils_base.py#L1462-L1464

You can see an example of it used here:

https://github.com/huggingface/transformers/blob/68810aa26c083fd97d976cef7ac65fdd9cc9b520/src/transformers/tokenization_utils_base.py#L1486-L1490

Feel free to open a PR to offer this change!

nreimers commented 2 years ago

Thanks for the pointer @LysandreJik

Will create a PR on this

lextoumbourou commented 2 years ago

This warning can be pretty noisy when your batch size is low, and the dataset is big. It would be nice only to see this warning once, as nreimers mentioned.

For anyone coming from Google who cannot suppress the error with eduOS's solution. The nuclear option is to disable all warnings in Python like this:

import logging
logging.disable(logging.WARNING)

heya5 commented 2 years ago

I'm just confused by this sentence in the warning

So the returned list will always be empty even if some tokens have been removed. Does it mean I will get a empty returned list ??

Larleyt commented 2 years ago

I upvote for the last message. I've started getting this error when was using tokenizer with text_pair (context) argument. And after that I've tried to decode messages and... have got a 2/3 of them empty. Why? How to prevent that? It was ok without text_pair arg.

searchivarius commented 1 year ago

Hi @nreimers @LysandreJik and others. This issue is still open and I found no respective deprecation warning in main. One quick fix that doesn't affect multiprocessing and global logging (at least not forever) is to set the logging level only before tokenization and restore it later. Yes, it is frustrating but it seems to work, e.g.:

        old_level = transformers.logging.get_verbosity()
        transformers.logging.set_verbosity_error()
        res : BatchEncoding = tok.batch_encode_plus(batch_text_or_text_pairs=input_list,
                                   padding='longest',
                                   #truncation='only_second',
                                   truncation='longest_first',
                                   return_tensors='pt')
        transformers.logging.set_verbosity(old_level)