salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.68k stars 394 forks source link

Question for Translation task and Failed to reproduce Translation results #49

Closed ghost closed 2 years ago

ghost commented 2 years ago

hi,

for c# java translation task. I see the code_bleu is not reported in the paper. could you share the scores ? or can the translation result can be published ? The code bleu score is important for this task.

Thanks

I downloaded the released model and run inference on java-C# translation task. I got the result as below which not matched in the raw paper:

cs to java translation

from transformers import RobertaTokenizer, T5ForConditionalGeneration
tokenizer = RobertaTokenizer.from_pretrained(
    'path/to/codet5/cs_java')
model = T5ForConditionalGeneration.from_pretrained(
    'path/to/codet5/cs_java')
model = model.to("cuda")

def predict(samples):
    results = []
    for sample in samples:
        input_ids = tokenizer(sample, return_tensors="pt").input_ids.to("cuda")
        generated_ids = model.generate(input_ids, max_length=510)
        rst = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
        results.append(rst)
    return results

cs = [line.strip() for line in open("./test.java-cs.txt.cs", "r")]
results = predict(cs)

scores

BLEU: 77.79 ngram match: 0.7778875426766637, weighted ngram match: 0.7859241463045725, syntax_match: 0.9075318329182916, dataflow_match: 0.9004485422377274 CodeBLEU score: 0.8429480160343139 EM: 0.649, = 649/1000

java to cs translation

BLEU: 81.57 ngram match: 0.8157761914953569, weighted ngram match: 0.827130874395443, syntax_match: 0.8968348170128586, dataflow_match: 0.9094303577631122 CodeBLEU score: 0.8622930601666927 EM: 0.618, = 618/1000

ghost commented 2 years ago

@yuewang-cuhk hi yue could you help on this ? really thanks.

yuewang-cuhk commented 2 years ago

Hi @runningmq , sorry for the late response. It seems your replicated inference script might have some discrepancy with the one we use. I would suggest that you can try to employ our released script to reproduce the results. Please refer to here for more details.