Improve deduplication for VarCorpus dataset splitting

Thanks for the really cool work and the open dataset!

I noticed the deduplication (when splitting into test/train) is missing some near/exact duplicates. I understand that near-duplicate detection is tricky, but I think some more aggressive (text-based) normalization could help.

For Ghidra, FUN_*, LAB_*, DAT_*, and PTR_* (among others) all leak offsets into the decompilation text. Similar labels exist in IDA.

For example, take these two cases from the Ghidra_O3 dataset (per-binary split):

8b24d40a9bcf3a361a784f0d32abb3f0_(0011c500) (test)
2334db8f8877d54ca418f044f1954d86_(0011c4e0) (train)

They are considered distinct by the current deduplication algorithm. However, diffing them shows that the only difference is the offsets of callees within their respective binaries:

 {
   if ((@@var_0@@wine_dbch_xmllite@@ & 8) != 0) {
     if (3 < @@var_1@@property@@) {
-      FUN_0011c1b0();
+      FUN_0011c190();
     }
-    FUN_0011bd30();
+    FUN_0011bd10();
   }
   if (@@var_1@@property@@ == 2) {
     *(@@var_2@@iface@@ + 0x28) = @@var_3@@value@@ != 0;
@@ -47,7 +47,7 @@
   }
   if (@@var_1@@property@@ != 1) {
     if ((@@var_0@@wine_dbch_xmllite@@ & 1) != 0) {
-      FUN_0011bd30();
+      FUN_0011bd10();
     }
     return 0x80004001;
   }

These functions are derived from identical source code.

I understand that we can't say for certain the calls are the same from the text above (we would need to check the callee code, I guess), but I would argue that even if you normalize the callsites (i.e. replacing FUN_* with FUN), the false-positive rate will be very low. Most pairs of nontrivial functions would be expected to differ by far more than just the names of the called functions.

~20% of the test set is duplicated in train, after normalizing for FUN_*, LAB_*, DAT_*, and PTR_*.

I would argue that deduplication should ignore all function names, not just the FUN_* names left by ghidra. There are cases where identical library functions may use an alternative function names (e.g. see --zprefix for zlib), and there are probably cases in the dataset where identical functions have names in some binaries and not others.

Thank you for your interest in our work!

The above example seems to be from Ghidra-O2, so I looked at a few cases and followed the suggested approach of normalizing FUN_*, LAB_*, DAT_*, and PTR_*. While I found instances where this approach could be helpful, there are also cases where, after normalizing, functions appear similar, but their source code and decompiled code from binaries with debug information say otherwise.

For example:

Train: 4f1305e20affd43e1b7adb081f46174d_(00112210) source code: https://github.com/horms/kexec-tools/blob/94fbe64fb22d61726ca0c0996987574b6c783c19/kexec/arch/i386/kexec-x86-common.c#L434
Test: 24ce04202dddc39d406c433308948e84_(001124f0) source code: https://github.com/nurse/nkf/blob/96c33713fc32f160597eb792d97694cda082afc7/nkf.c#L822

void FUN_00112210(void)
{  
     long @@var_0@@uVar1@@;
-    @@var_0@@uVar1@@ = FUN_00111d90();
+    @@var_0@@uVar1@@ = FUN_001124c0()
     if (@@var_0@@uVar1@@ != 0) 
    {   
      return; 
    }
-    FUN_00111df0(); 
+    FUN_0010cab0();
     return;
}

Train: e3119731696168ee0e4decb803e78f3e_(00109590) source code: https://github.com/tkengo/highway/blob/350e1c2b782fcd0684c59969c65d7ff4669d1b23/src/line_list.c#L56
Test: 9b51d690b985609691ce8ca14c8c17d4_(001091f0) source code: https://github.com/Radmind/radmind/blob/5e65262640af958fbf7628ce1af32ffb1ab96fb2/list.c#L61

void FUN_00109590(void *@@var_0@@list@@)
{  
-     FUN_00109530();
+     FUN_001091c0()
      free(@@var_0@@list@@);
      return;
  }

In the above examples, the function in each pair is from a different project, has different call sites, and their source code varies. I think the suggested approach can be incorporated as future work, but it requires a more detailed analysis to avoid removing valid function samples.

Thanks for your thorough response!

The examples you gave are interesting, and I am generally curious how often such cases occur. However, in your dataset we can also use the (debug) function name (func_name_dwarf) to disambiguate. The examples I gave are both from a function named xmlwriter_SetProperty (likely in projects which share a dependency, or in forks). However, the examples you suggested do not have the same function names (get_acpi_rsdp vs. nkf_default_encoding, free_match_line_list vs list_free), which suggests they are not duplicates and should not be removed from the dataset.

Counting only cases where normalized function bodies and function names are identical still shows a similar level of duplication (~19% of functions in test are also in train). Of course, this still may not catch all duplicate functions (and definitely will not match slight variations of the same function), but the functions it catches are almost certainly source-code duplicates.

If not de-duplication, my hope is that future work at least reports accuracy on memorizable and non-memorizable functions separately, or at least compares to a memorizing baseline. The simplest example of this would be a baseline which matches functions in the training set (e.g. using FLIRT, binja sigkit, or just perfect string matches after thorough normalization) and predicts the same names. Clearly, VarBERT is doing more than memorization, but it would be nice if future work quantified this, since there are simpler tools for dealing with libraries of known functions.

sefcom / VarBERT

Improve deduplication for VarCorpus dataset splitting #1