Scanning big filesets, pre-pruning by filesize

bleuge commented 4 years ago

Hi, I have a question, maybe not entirely related to TLSH. In the case of scanning really big filesets against a list of TLSH, looking for small differences. I know filesize is part of the 3 first bytes in the hash, but there is any rule, taking account I'll pretend to use Tlsh, so I could skip files when there are out of certain filesize limits. I could store filesizes apart if needed. The idea is if a file has size X, and I already have its TLSH. And want to compare it against another one, if filesizes differences are over a certain ratio, I skip calculating the new tlsh, as I suppose if files are too different in sizes, there is not needed for calculating tlsh? The question here is, have you tested this? It's worth the effort? I think is not directly related to TLSH but how we use it.

jonjoliver commented 4 years ago

Good question. Yes - we should be coming out with more details later. I am happy to chat offline at jon_oliver@trendmicro.com in the meantime

abgoldberg commented 3 years ago

Is there any further update on this, or other ways to prune the search space to find matches against a large data set?

jonjoliver commented 3 years ago

Hi @bleuge , @abgoldberg and others,

I have written up a technical overview on the issues for fast search http://tlsh.org/papers.html And then how to use fast search to do scalable clustering. The technical overview points to 2 conference papers that discuss the issues

Cheers jono

trendmicro / tlsh

Scanning big filesets, pre-pruning by filesize #82