I am interested in semantic code clones in general, and I recently found this CodeXGLUE version of BigCloneBench. I manually checked the pairs included in the train/valid/test.txt, trying to reason about some characteristics of real-world code clone, but I realize there are many pairs that are not looking like clone. It seems that such pairs are labeled as "1". I am listing a few examples below, and these examples seem common in the dataset.
Could you please explain why, in this dataset, these pairs are regarded as "clone" and how does CodeXGLUE process BigCloneBench to generate such pairs? I also checked the CodeXGLUE paper regarding the BigCloneBench, but I could not find the answer about how clone pairs are built and how they are labeled. Given that many deep learning (DL) tools (e.g., CodeBERT, GraphCodeBERT, CodeT5) are using this dataset for evaluation, it will be great if we can understand more about which aspects DL models really learns regarding these pairs that do not look like "clone".
I would greatly appreciate your answer if you can enlighten me a bit!
# test.txt --> "984683\t411595\t1"
9846843
public byte[] getResponse() {
final ByteArrayInputStream bais = new ByteArrayInputStream(request);
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
List<String> lines = Collections.emptyList();
try {
@SuppressWarnings("unchecked") List<String> dl = IOUtils.readLines(bais);
lines = dl;
} catch (IOException ioex) {
throw new AssertionError(ioex);
}
String resource = null;
for (String line : lines) {
if (line.startsWith("GET ")) {
int endIndex = line.lastIndexOf(' ');
resource = line.substring(4, endIndex);
}
}
final PrintStream printStream = new PrintStream(baos);
if (resource == null) {
printStream.println("HTTP/1.1 400 Bad Request");
} else {
final InputStream inputStream = getClass().getResourceAsStream(resource);
if (inputStream == null) {
printStream.println("HTTP/1.1 404 Not Found");
printStream.println();
} else {
printStream.println("HTTP/1.1 200 OK");
printStream.println();
try {
IOUtils.copy(inputStream, printStream);
} catch (IOException ioex) {
throw new AssertionError(ioex);
}
}
}
printStream.flush();
printStream.close();
return baos.toByteArray();
}
-------------------------------
411595
private void displayDiffResults() throws IOException {
File outFile = File.createTempFile("diff", ".htm");
outFile.deleteOnExit();
FileOutputStream outStream = new FileOutputStream(outFile);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(outStream));
out.write("<html><head><title>LOC Differences</title>\n" + SCRIPT + "</head>\n" + "<body bgcolor='#ffffff'>\n" + "<div onMouseOver=\"window.defaultStatus='Metrics'\">\n");
if (addedTable.length() > 0) {
out.write("<table border><tr><th>Files Added:</th>" + "<th>Add</th><th>Type</th></tr>");
out.write(addedTable.toString());
out.write("</table><br><br>");
}
if (modifiedTable.length() > 0) {
out.write("<table border><tr><th>Files Modified:</th>" + "<th>Base</th><th>Del</th><th>Mod</th><th>Add</th>" + "<th>Total</th><th>Type</th></tr>");
out.write(modifiedTable.toString());
out.write("</table><br><br>");
}
if (deletedTable.length() > 0) {
out.write("<table border><tr><th>Files Deleted:</th>" + "<th>Del</th><th>Type</th></tr>");
out.write(deletedTable.toString());
out.write("</table><br><br>");
}
out.write("<table name=METRICS BORDER>\n");
if (modifiedTable.length() > 0 || deletedTable.length() > 0) {
out.write("<tr><td>Base: </td><td>");
out.write(Long.toString(base));
out.write("</td></tr>\n<tr><td>Deleted: </td><td>");
out.write(Long.toString(deleted));
out.write("</td></tr>\n<tr><td>Modified: </td><td>");
out.write(Long.toString(modified));
out.write("</td></tr>\n<tr><td>Added: </td><td>");
out.write(Long.toString(added));
out.write("</td></tr>\n<tr><td>New & Changed: </td><td>");
out.write(Long.toString(added + modified));
out.write("</td></tr>\n");
}
out.write("<tr><td>Total: </td><td>");
out.write(Long.toString(total));
out.write("</td></tr>\n</table></div>");
redlinesOut.close();
out.flush();
InputStream redlines = new FileInputStream(redlinesTempFile);
byte[] buffer = new byte[4096];
int bytesRead;
while ((bytesRead = redlines.read(buffer)) != -1) outStream.write(buffer, 0, bytesRead);
outStream.write("</BODY></HTML>".getBytes());
outStream.close();
Browser.launch(outFile.toURL().toString());
}
Dear authors,
I am interested in semantic code clones in general, and I recently found this CodeXGLUE version of BigCloneBench. I manually checked the pairs included in the train/valid/test.txt, trying to reason about some characteristics of real-world code clone, but I realize there are many pairs that are not looking like clone. It seems that such pairs are labeled as "1". I am listing a few examples below, and these examples seem common in the dataset.
Could you please explain why, in this dataset, these pairs are regarded as "clone" and how does CodeXGLUE process BigCloneBench to generate such pairs? I also checked the CodeXGLUE paper regarding the BigCloneBench, but I could not find the answer about how clone pairs are built and how they are labeled. Given that many deep learning (DL) tools (e.g., CodeBERT, GraphCodeBERT, CodeT5) are using this dataset for evaluation, it will be great if we can understand more about which aspects DL models really learns regarding these pairs that do not look like "clone".
I would greatly appreciate your answer if you can enlighten me a bit!
Also