salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.66k stars 391 forks source link

Defect detection model setup: unigram generation or embedding classification? #107

Closed MayankAgarwal closed 1 year ago

MayankAgarwal commented 1 year ago

Hi CodeT5 authors,

Thank you for providing the code and models for your papers. They have been extremely useful.

I am trying the defect detection pipeline and noticed that in the paper it is mentioned that you generate a unigram sequence from the decoder (page 8, under table 6) while in the code the Defect Model (Code Link) trains a classifier on top of decoder hidden states for CodeT5 and other models as well. From what I understand, the paper suggests that you formulated this problem as seq2seq task with code as input and a yes/no or true/false output as unigram sequence.

Could you please clarify which approach did you take for this task?

Thanks!

yuewang-cuhk commented 1 year ago

Hi there, in fact both ways are able to achieve good results on the defect detection task. For our original CodeT5 paper, we experimented with both ways and decided to adopt the way of unigram generation to make it into a unified seq2seq format. For CodeT5+, we use the embedding classification way as it has a flexible encoder-only mode to support this task. Hope this can clarify your confusions.

MayankAgarwal commented 1 year ago

Thank you! This was really helpful