github / semantic

Parsing, analyzing, and comparing source code across many languages
8.96k stars 454 forks source link

RWS hashing & RNG are counterproductive for the similarity metric #25

Open robrix opened 5 years ago

robrix commented 5 years ago

The hashing & RNG used by RWS act counter to its intention of approximating the tree edit distance with the squared Euclidean distance between the d-dimensional feature vectors summarizing nodes. This is caused by two factors:

  1. Many hash functions have the property that a change of one bit in the input will result in a change to approximately half the bits in the output. That means that similar values will result in very different hashes. This is good for hash tables, but counter to our purposes where we want similar values to have similar unit vectors. (By way of contrast, the continuity property would be preferable for our purposes.)
  2. The RNG most likely has the same property, such that similar seeds will result in different sequences of random values.

These problems primarily affect the unit vectors, because only individual nodes’ p,q-grams are hashed to seed the RNG and compute the d-dimensional unit vector. So while two similar leaf nodes may have dramatically different unit vectors (and therefore feature vectors), two trees which are mostly the same will nevertheless have mostly similar feature vectors. That is, RWS works in aggregate because similarity is considered across (relatively large) subtrees; the discontinuity introduced by hashing & the RNG is effectively amortized.

While that works well at larger scales, it falls down fairly consistently at smaller ones. That’s particularly problematic for our use case where small changes tend to dominate for human factors: smaller changes are easier to review. It’s also a factor in the unpredictability that we’ve noticed: reordering the syntax constructors changes the hash values assigned to each, which in turn can result in different results from our test fixtures.

robrix commented 5 years ago

Instead, we want something more like locality-sensitive hashing, which has the continuity property. I’ve also been pointed towards principal component analysis, tho I don’t know much about it just yet.

robrix commented 5 years ago

Some references following a conversation with @kavgan:

robrix commented 5 years ago

The conclusion I arrived at from my conversation with @kavgan:

Reading further, I realize that I misunderstood the roles of LSH, MinHash, and shingling. We’ve got ~three different problems here:

  • A similarity metric for labels. Our current system just hashes them so similar text in e.g. leaf nodes is essentially ignored; shingling and so forth would seem applicable.
  • A similarity metric for subtrees. We currently use random-walk similarity with the problems detailed in that issue. The Jaccard similarity on bags of p,q-grams seems applicable, with MinHashing being an efficient approximation thereof.
  • A mechanism to select the most-similar pairs of subtrees from two lists. We currently use a k-d tree for nearest-neighbour lookups based on the random-walk similarity, with the problems that entails, plus some biases that have been hard to eliminate. LSH comes into play here, grouping candidate pairs together into bands.

So where I’ve been focusing on alternatives for computing the d-dimensional unit vectors which RWS entails (specifically, how to do it without hashing or the random number generator), LSH would eliminate the need for these unit vectors altogether.

Notably, each sub-problem is orthogonal. I’d suggest we tackle MinHashing and LSH before shingling; improvements to leaf similarity will be most noticeable with improvements to small branch similarity already in place.