Closed JohnGiorgi closed 2 years ago
Sorry for the delay in answering this, pinging @SaulLu so she can take a look when she has the time :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@LysandreJik @SaulLu This still happens on the latest version of Transformers and with the latest version of DeBERTa-v3, so I am commenting to keep it open.
Thank you very much for the detailed issue!
Indeed, you have put your finger on an inconsistency: what is happening is that the slow and fast tokenizer of DeBERTa (original) do not behave in the same way:
tokenizer = AutoTokenizer.from_pretrained(
"microsoft/deberta-base",
additional_special_tokens=special_tokens,
use_fast=False
)
print(f"Output with {type(tokenizer)}:\n", tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => Output with <class 'transformers.models.deberta.tokenization_deberta.DebertaTokenizer'>:
# some text with an additional special token<SPECIAL>
# DeBERTa (original) fast
tokenizer = AutoTokenizer.from_pretrained(
"microsoft/deberta-base",
additional_special_tokens=special_tokens,
use_fast=True
)
print(f"Output with {type(tokenizer)}:\n", tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => Output with <class 'transformers.models.deberta.tokenization_deberta_fast.DebertaTokenizerFast'>:
# some text with an additional special token <SPECIAL>
As a result the issue seems more linked to the the workflow of the slow tokenizers.
However, finding the right way to fix the problem is less obvious because:
convert_tokens_to_string
is used in _decode
of PreTrainedTokenizer
(the base class of all slow tokenizers)DebertaTokenizer
(original) inherits from GPT2Tokenizer
where convert_tokens_to_string
is definedDebertaV2Tokenizer
uses a different strategy than GPT2Tokenizer
to implement convert_tokens_to_string
To get an broader view of the problem, could you share with us what your use case is for this command (what do you want to see with it? Is it manual work? in production?)?
Thanks for the detailed response @SaulLu!
I have a task where I need to add special tokens to the text to introduce some structure. A common use case of this are the "marker tokens" used in named relation extraction. A simplified example is:
text = "<ORG> Apple </ORG> is looking at buying <GPE> U.K. </GPE> startup for <MONEY> $1 billion </MONEY>"
Ideally, we could add all these tokens as additional_special_tokens
so they don't get split. Indeed, it works fine with BERT and the original DeBERTa, so I was curious as to why it doesn't work with DeBERTa V3.
Thank you very much for your answer! Very interesting use case!
And in particular, why on this use case do you need to use tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)
?
For DeBERTa (original and V3), I guess the tokenizer.decode(tokenizer.encode(text)
command should give the result you were expecting initially. :blush:
Ahhh, tokenizer.decode(tokenizer.encode(text)
does work! And it works for BERT as well.
There was no specific reason to use convert_tokens_to_string
, I just thought that would be the correct method to use! Thanks for the tip with tokenizer.decode(tokenizer.encode(text)
Actually, I now remember why I wanted to use convert_tokens_to_string
. In the case of an autoregressive decoder, generating some output token by token, that may include some special tokens. I would like to recover from output a string, which maintains the expected spaces around special tokens. Here is a simplified example:
special_tokens = ["<DISEASE>", "</DISEASE>", "<DRUG>", "</DRUG>"]
text = "<DISEASE> Anaphylaxis </DISEASE> to <DRUG> cisplatin </DRUG> is an infrequent life-threatening complication"
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-small", additional_special_tokens=special_tokens)
# Tokenize the text to mimic what a decoder would have generated, token-by-token
decoder_output = tokenizer.tokenize(text)
print(decoder_output)
# => ['<DISEASE>', '▁Ana', 'phyl', 'axis', '</DISEASE>', '▁to', '<DRUG>', '▁cisplatin', '</DRUG>', '▁is', '▁an', '▁infrequent', '▁life', '-', 'threatening', '▁complication']
# Try to go backwards
print(tokenizer.convert_tokens_to_string(decoder_output))
# => <DISEASE> Anaphylaxis</DISEASE> to<DRUG> cisplatin</DRUG> is an infrequent life-threatening complication
Which doesn't produce the correct spacing. I can solve that using the decode(encode())
strategy
print(tokenizer.decode(tokenizer.encode(tokenizer.convert_tokens_to_string(decoder_output), add_special_tokens=False)))
# => <DISEASE> Anaphylaxis </DISEASE> to <DRUG> cisplatin </DRUG> is an infrequent life-threatening complication
I guess the only downside is that you have to call 3 (!) tokenizer
methods to get the job done (decode
, encode
and convert_tokens_to_string
).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 4.12.5Who can help
@LysandreJik
Information
Model I am using (Bert, XLNet ...):
microsoft/deberta-v3-small
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
additional_special_tokens
.tokenize
that contains one or more of those special tokens.convert_tokens_to_string
Expected behavior
I expect that spaces before/after any special tokens added with
additional_special_tokens
will be preserved when callingtokenizer.convert_tokens_to_string(tokenizer.tokenize(text))
.