microsoft / CodeBERT

CodeBERT
MIT License
2.19k stars 450 forks source link

Potential Data Leakage in UniXcoder-base zero-shot setting for code search task #283

Closed lazyhope closed 1 year ago

lazyhope commented 1 year ago

Hi, I've noticed the three datasets used for code search fine-tuning: AdvTest, CosQa and CSN all originate from CodeSearchNet. If this is the case, isn't there a possibility of data overlap? This concern arises from the fact that the unixcoder-base was also pretrained on NL-PL pairs from the CodeSearchNet dataset. Could you please clarify this?

Thanks

guoday commented 1 year ago

UniXcoder-base is only pre-trained on the training set. The test sets of AdvTest, CosQa, and CSN are excluded during the pre-training phase.

lazyhope commented 1 year ago

I see, thank you for the reply!