malwaredb / malwaredb-rs

MalwareDB: bookkeeping for malware, goodware, and unknown files with relationship discovery
https://malwaredb.net/
Apache License 2.0
28 stars 4 forks source link

Performance: pg search is slow #165

Open ghost opened 12 months ago

ghost commented 12 months ago

Hi,

MalwareDB is great, however when we testing file search up to 10M files, TLSH search requires 10s. I found that TLSH already published the index algorithm( https://tlsh.org/papers.html)

Do we have milestone for better search index? Thanks!

rjzak commented 12 months ago

Thanks for the feedback! I've mostly be focused on getting features in place and working on usability issues. But performance is definitely something on my mind, and it's not yet as fast as I'd like it to be.

Are you using SQLite or Postgres? Postgres should be faster since it uses a C extension, and SQLite has to load all the data in memory then search.

Edit: I missed the pg part in the title when I first looked, I'll investigate.

rjzak commented 12 months ago

@maxmeng-oss How was the performance with 10M files with the other Postgres extensions (lzjd, ssdeep)?

ghost commented 12 months ago

I havn't tested LZJD, SSDEEP yet. Since B-tree, Hash, SP-GiST are same linear grow on TLSH, my educated guess is it doesn't matter what hash algorithm you choose, it will require O(N) search on all data.

rjzak commented 11 months ago

This might help: https://github.com/jinyyu/tlsh_gist There's also this, but I don't understand it: https://zhuanlan.zhihu.com/p/497732848

Related Postgres docs: