kevemueller / kTLSH

A fresh look at implementing TLSH in Java.
Apache License 2.0
4 stars 0 forks source link

Factors to consider in deciding which algorithm variant to use #1

Open rpygithub opened 4 years ago

rpygithub commented 4 years ago

Before I begin, I would like to say that this is a welcome and overdue implementation of the TLSH algorithm that respects Java's conventions. Thank you for putting work into writing an implementation that is more efficient and well-documented.

I have but one suggestion regarding the documentation: I think it would be worth describing in general terms what the benefits and drawbacks are of the different window sizes and digest lengths in the context of TLSH. Does a sliding window value larger than 5 offer greater accuracy when comparing hashes for similarity? Should the choice be influenced by the size of files in a dataset?

These questions sprung to my mind as I reviewed the table. I am not fully familiar with all of the theory behind TLSH, so a paragraph about it would offer valuable insight.

kevemueller commented 4 years ago

Hi broindhash, This is merely an implementation of the algorithm, hence I would like to refer you to the creators of the algorithm for details on its applicability and use of the different variants. Best starting point is their paper on the algorithm itself: https://github.com/trendmicro/tlsh/blob/master/TLSH_CTC_final.pdf There you will see how the authors used statistical analysis to compare the efficiency of the variants. I have chosen to expose all the variants to make it simple to compare them in a similar fashion for your own data.

Summarizing: The variant shall be chosen based on the actual properties of the messages that shall be hashed with TLSH, you should perform your own analysis on actual data to determine the best choice of the variant - or go with the author's choice of 128-5-1 as a generic good fit.