trendmicro / tlsh

Other
726 stars 135 forks source link

TLSH distance metric appears to violate triangle inequality #130

Closed dmbaggett closed 6 months ago

dmbaggett commented 1 year ago

Using v4.11.2 of tlsh (last git commit 74efd09a47aecb12f4457b542ebc21d098f901b1) I'm seeing cases where three TLSH hashes appear to violate the triangle inequality, leading to inconsistent search results from metric tree data structure like VP trees. An example:

x = 'T11B735492E6411D2A067795F5D1236BBCE7E94041AB0245F0B0AE35379B8FCA14AB73FC' y = 'T12F836353E6421D2E493790F9C2329FA8E3D64156A71245E4B0AE35375B8FCA14AB33FC' z = 'T1879374A3A6416D2E493790B9C2729FA8E3D64116971241E4F4AE34371B8FCD14EB72FC'

d(x,z) (85) > d(x,y) (38) + d(y,z) (24)

Here is a Python script that reproduces this. I did verify that hamming distance between the hex digests for these does adhere to the triangle inequality. (I've attached it as a .txt file since Github doesn't seem to allow attachment of .py files.)

triangle_inequality_violation.txt

jonjoliver commented 6 months ago

Hi dmbaggett

Yes - the tlsh distance function does not obey the triangle inequality. Tlsh distance was defined back in 2013 before quite a lot of hardware support came for metrics which satisfy the triangle inequality. The discrepancies were introduced due to experiments in adversarial environments. These experiments are described in http://tlsh.org/papersDir/Design_TLSH_2021.pdf

We could create a new security LSH which satisfies the triangle inequality to make search more efficient, but that would be a new hash.

Cheers jono