Missing "java" token in Hugging Face Tokenizer

Ahmadreza-SY commented 1 year ago

Hi,

I am trying to replicate the results of PLBART for the code refinement fine-tuning task using Hugging Face. When I tokenize methods that contain the "java" token and then decode them, the "java" token is strangely removed! Here is my code:

code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
tokenizer = model_tokenizer_class.from_pretrained("uclanlp/plbart-base", language_codes="base")
model_inputs = tokenizer([code])
print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
# The code output is: "public void METHOD_1 ( TYPE_1 VAR_1 ) throws .lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"

Also, is there any hugging face implementation of the code refinement task using PLBART? My implementation does not achieve the EM and BLEU reported for the test set. I executed the existing fairseq implementation and got EM: 17.67, however my hugging face implementation gets EM: 5.62! What important factors should I check?

wasiahmad commented 1 year ago

Thank you for pointing this out. It is a bug as we can see here, instead FAIRSEQ_LANGUAGE_CODES should be defined as:

FAIRSEQ_LANGUAGE_CODES = {
    "base": ["__java__", "__python__", "__en_XX__"],
    "multi": ["__java__", "__python__", "__en_XX__", "__javascript__", "__php__", "__ruby__", "__go__"],
}

Otherwise, the regular token java in the vocab will be treated as a special token. Now in the following:

print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))

Since you do skip_special_tokens=True, the java token is removed.

To verify if tokenization works correct, we can do:

from transformers import PLBartTokenizer
code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
tokenizer = PLBartTokenizer.from_pretrained("uclanlp/plbart-base", language_codes="base")
print(tokenizer.tokenize(code))

which outputs:

['▁public', '▁void', '▁METHOD', '_1', '▁(', '▁TYPE', '_1', '▁VAR', '_1', '▁)', '▁throws', 'java', '▁.', 'lang', '.', 'Exception', '▁{', '▁super', '▁.', '▁METHOD', '_1', '▁(', '▁VAR', '_1', '▁)', '▁;', '▁METHOD', '_2', '▁(', '▁VAR', '_1', '▁)', '▁;', '▁}']

And I think the tokenization is fine.

wasiahmad commented 1 year ago

@gchhablani Can you help resolving the bug? The FAIRSEQ_LANGUAGE_CODES should be defined as:

FAIRSEQ_LANGUAGE_CODES = {
    "base": ["__java__", "__python__", "__en_XX__"],
    "multi": ["__java__", "__python__", "__en_XX__", "__javascript__", "__php__", "__ruby__", "__go__"],
}

wasiahmad commented 1 year ago

Resolved with this PR (https://github.com/huggingface/transformers/pull/19980).

wasiahmad / PLBART

Missing "java" token in Hugging Face Tokenizer #46