BigCloneBench is not a suitable dataset

By accident I stumbled upon CodeXGLUE and saw that BigCloneBench is included as a dataset. Unless a full ground truth has been created freshly, the existing ground truth cannot be used for training as it is only a partial ground truth. The assumption that pairs of snippets from different functionalities cannot be clones is wrong as BigCloneBench contains snippets that are present in multiple functionalities. The authors of BigCloneBench state clearly that it cannot be used for measuring precision.

microsoft / CodeXGLUE

BigCloneBench is not a suitable dataset #93