huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.66k stars 26.93k forks source link

DeBERTa-v3 does not preserve spaces before/after additional special tokens in convert_tokens_to_string output #14502

Closed JohnGiorgi closed 2 years ago

JohnGiorgi commented 2 years ago

Environment info

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet ...): microsoft/deberta-v3-small

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. Initialize a DeBERTa-v3 tokenizer with additional_special_tokens.
  2. Tokenize some text with tokenize that contains one or more of those special tokens.
  3. Attempt to convert the tokens to a string with convert_tokens_to_string
  4. DeBERTa-v3 does not include a space before/after the special token in the resulting string. BERT (and earlier versions of DeBERTa) do.
from transformers import AutoTokenizer, AutoModel

special_tokens = ["<SPECIAL>"]
text = "some text with an additional special token <SPECIAL>"

# BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token <SPECIAL>

# DeBERTa (original)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token <SPECIAL>

# DeBERTa (v3)
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-small", additional_special_tokens=special_tokens)
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => some text with an additional special token<SPECIAL>

Expected behavior

I expect that spaces before/after any special tokens added with additional_special_tokens will be preserved when calling tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)).

LysandreJik commented 2 years ago

Sorry for the delay in answering this, pinging @SaulLu so she can take a look when she has the time :)

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

JohnGiorgi commented 2 years ago

@LysandreJik @SaulLu This still happens on the latest version of Transformers and with the latest version of DeBERTa-v3, so I am commenting to keep it open.

SaulLu commented 2 years ago

Thank you very much for the detailed issue!

Indeed, you have put your finger on an inconsistency: what is happening is that the slow and fast tokenizer of DeBERTa (original) do not behave in the same way:

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/deberta-base", 
    additional_special_tokens=special_tokens, 
    use_fast=False
)
print(f"Output with {type(tokenizer)}:\n", tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => Output with <class 'transformers.models.deberta.tokenization_deberta.DebertaTokenizer'>:  
# some text with an additional special token<SPECIAL>

# DeBERTa (original) fast
tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/deberta-base", 
    additional_special_tokens=special_tokens, 
    use_fast=True
)
print(f"Output with {type(tokenizer)}:\n", tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)))
# => Output with <class 'transformers.models.deberta.tokenization_deberta_fast.DebertaTokenizerFast'>: 
# some text with an additional special token <SPECIAL>

As a result the issue seems more linked to the the workflow of the slow tokenizers.

However, finding the right way to fix the problem is less obvious because:

To get an broader view of the problem, could you share with us what your use case is for this command (what do you want to see with it? Is it manual work? in production?)?

JohnGiorgi commented 2 years ago

Thanks for the detailed response @SaulLu!

I have a task where I need to add special tokens to the text to introduce some structure. A common use case of this are the "marker tokens" used in named relation extraction. A simplified example is:

text = "<ORG> Apple </ORG> is looking at buying <GPE> U.K. </GPE> startup for <MONEY> $1 billion </MONEY>"

Ideally, we could add all these tokens as additional_special_tokens so they don't get split. Indeed, it works fine with BERT and the original DeBERTa, so I was curious as to why it doesn't work with DeBERTa V3.

SaulLu commented 2 years ago

Thank you very much for your answer! Very interesting use case!

And in particular, why on this use case do you need to use tokenizer.convert_tokens_to_string(tokenizer.tokenize(text)?

For DeBERTa (original and V3), I guess the tokenizer.decode(tokenizer.encode(text) command should give the result you were expecting initially. :blush:

JohnGiorgi commented 2 years ago

Ahhh, tokenizer.decode(tokenizer.encode(text) does work! And it works for BERT as well.

There was no specific reason to use convert_tokens_to_string, I just thought that would be the correct method to use! Thanks for the tip with tokenizer.decode(tokenizer.encode(text)

JohnGiorgi commented 2 years ago

Actually, I now remember why I wanted to use convert_tokens_to_string. In the case of an autoregressive decoder, generating some output token by token, that may include some special tokens. I would like to recover from output a string, which maintains the expected spaces around special tokens. Here is a simplified example:

special_tokens = ["<DISEASE>", "</DISEASE>", "<DRUG>", "</DRUG>"]
text = "<DISEASE> Anaphylaxis </DISEASE> to <DRUG> cisplatin </DRUG> is an infrequent life-threatening complication"

tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-small", additional_special_tokens=special_tokens)

# Tokenize the text to mimic what a decoder would have generated, token-by-token
decoder_output = tokenizer.tokenize(text)
print(decoder_output)
# => ['<DISEASE>', '▁Ana', 'phyl', 'axis', '</DISEASE>', '▁to', '<DRUG>', '▁cisplatin', '</DRUG>', '▁is', '▁an', '▁infrequent', '▁life', '-', 'threatening', '▁complication']
# Try to go backwards
print(tokenizer.convert_tokens_to_string(decoder_output))
# => <DISEASE> Anaphylaxis</DISEASE> to<DRUG> cisplatin</DRUG> is an infrequent life-threatening complication

Which doesn't produce the correct spacing. I can solve that using the decode(encode()) strategy

print(tokenizer.decode(tokenizer.encode(tokenizer.convert_tokens_to_string(decoder_output), add_special_tokens=False)))
# => <DISEASE> Anaphylaxis </DISEASE> to <DRUG> cisplatin </DRUG> is an infrequent life-threatening complication

I guess the only downside is that you have to call 3 (!) tokenizer methods to get the job done (decode, encode and convert_tokens_to_string).

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.