Tokenizer: setting lstrip to False for special tokens

salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation

https://arxiv.org/abs/2305.07922

BSD 3-Clause "New" or "Revised" License

2.65k stars 391 forks source link

Tokenizer: setting lstrip to False for special tokens #147

Open JoaoLages opened 10 months ago

JoaoLages commented 10 months ago

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5p-220m-py")

code = """
    # this is a code comment
    <extra_id_0>
"""

print(tokenizer.decode(tokenizer(aux)["input_ids"]))

output:

<s>
    # this is a code comment<extra_id_0>
</s>

It seems that \t\n is not being encoded (or decoded) properly :(

JoaoLages commented 10 months ago

I just found out that \n and \t have the exact same token id 😐

tokenizer.convert_tokens_to_ids(["\n", "\t"])
Out[35]: [3, 3]

Edit: yes, they are both the UNK id

tokenizer.unk_token_id
Out[39]: 3

JoaoLages commented 10 months ago

It seems that the problem is with \n and \t before the special tokens:

aux
Out[58]: '\t\n# this is a code comment\n\t<extra_id_0>'
tokenizer.decode(tokenizer(aux)["input_ids"], skip_special_tokens=False)
Out[59]: '<s>\t\n# this is a code comment<extra_id_0></s>'

aux
Out[62]: '\n# this is a code comment\n<extra_id_0>'
tokenizer.decode(tokenizer(aux)["input_ids"], skip_special_tokens=False)
Out[63]: '<s>\n# this is a code comment<extra_id_0></s>'

fillassuncao commented 10 months ago

This is happening because all <extra_id_*> tokens have lstrip set to True. Any reason for this decision?

JoaoLages commented 10 months ago

This is happening because all <extra_id_*> tokens have lstrip set to True. Any reason for this decision?

Indeed, this makes things work:

tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken(at.content, rstrip=False, lstrip=False, single_word=False, normalized=True) for at in tokenizer.special_tokens_map_extended['additional_special_tokens']]}, replace_additional_special_tokens=True)

yuewang-cuhk commented 10 months ago

Hi both, thanks for identifying the issue and providing the solution! We did not intentionally to have lstrip set to True.