Open JoaoLages opened 10 months ago
I just found out that \n and \t have the exact same token id 😐
tokenizer.convert_tokens_to_ids(["\n", "\t"])
Out[35]: [3, 3]
Edit: yes, they are both the UNK id
tokenizer.unk_token_id
Out[39]: 3
It seems that the problem is with \n and \t before the special tokens:
aux
Out[58]: '\t\n# this is a code comment\n\t<extra_id_0>'
tokenizer.decode(tokenizer(aux)["input_ids"], skip_special_tokens=False)
Out[59]: '<s>\t\n# this is a code comment<extra_id_0></s>'
aux
Out[62]: '\n# this is a code comment\n<extra_id_0>'
tokenizer.decode(tokenizer(aux)["input_ids"], skip_special_tokens=False)
Out[63]: '<s>\n# this is a code comment<extra_id_0></s>'
This is happening because all <extra_id_*>
tokens have lstrip
set to True
.
Any reason for this decision?
This is happening because all
<extra_id_*>
tokens havelstrip
set toTrue
. Any reason for this decision?
Indeed, this makes things work:
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken(at.content, rstrip=False, lstrip=False, single_word=False, normalized=True) for at in tokenizer.special_tokens_map_extended['additional_special_tokens']]}, replace_additional_special_tokens=True)
Hi both, thanks for identifying the issue and providing the solution! We did not intentionally to have lstrip
set to True
.
output:
It seems that \t\n is not being encoded (or decoded) properly :(