trendmicro / tlsh

Other
726 stars 135 forks source link

meaning of extra_constant value in search tlshCluster #113

Open TheNha opened 2 years ago

TheNha commented 2 years ago

Hi. Im reading tlshCluster that you publish recently. I don't understand the extra_constant value, it in function VPTSearch in file hac_lib.py. Can you help me explain this value? Thank you very much.

Querela commented 1 year ago

Might be related to #130 ? I was looking into vantage point trees and trying to understanding how they work. [1] [2] [3] When testing I found that the tree sometimes didn't return the nearest object if I lowered the extra_constant. If I increased it instead, I did perform more comparisons. In my understanding, it functions like some error margin and 20 might be some experimental optimal value? It could be related to the text length difference penalty that is also included in the distance score.

[1] http://stevehanov.ca/blog/index.php?id=130 [2] https://fribbels.github.io/vptree/writeup [3] http://pnylab.com/papers/vptree/main.html

Here some example code: https://gist.github.com/Querela/d34d76bf090863418168527bc5aba3ff (NOTE: I did some cleanup since it contained a lot more other stuff but did not run it again. It might be missing some imports? Just write me. But you can simply try out some different values if you run it in some interactive shell.)