trendmicro / tlsh

Other
732 stars 136 forks source link

HAC-T clustering is very slow with larger data size, for 500K tlsh list it took ~6 hours #124

Open SrikanthPusarla opened 2 years ago

SrikanthPusarla commented 2 years ago

Hi The HAC-T clustering for 500 K TLSH list took 6 hours, but The paper claimed it took ~ 2hours 10 min for 10 million samples (HAC-T and Fast Search for Similarity in Security --- chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=https%3A%2F%2Ftlsh.org%2FpapersDir%2FCOINS_2020_camera_ready.pdf&clen=191519&chunk=true )

Please help me how you achieved this faster clustering, Does it support multi threading

My experiment: Data: 500 K tlsh input Command: python hac-t.py -f -o -cdist 90 -showtime 1 -showcl 1 Machine: 16 core 122 GB ram Python 3.8.8

Thanks