microsoft / CodeXGLUE

CodeXGLUE
MIT License
1.56k stars 366 forks source link

BigCloneBench is not a suitable dataset #93

Open jkrinke opened 3 years ago

jkrinke commented 3 years ago

By accident I stumbled upon CodeXGLUE and saw that BigCloneBench is included as a dataset. Unless a full ground truth has been created freshly, the existing ground truth cannot be used for training as it is only a partial ground truth. The assumption that pairs of snippets from different functionalities cannot be clones is wrong as BigCloneBench contains snippets that are present in multiple functionalities. The authors of BigCloneBench state clearly that it cannot be used for measuring precision.