qiyunlab / binarena

BinaRena: Interactive Visualization and Binning of Metagenomic Contigs
BSD 3-Clause "New" or "Revised" License
30 stars 6 forks source link

Silhouette calculation on large dataset #84

Closed qiyunzhu closed 2 years ago

qiyunzhu commented 2 years ago

This solution calculates pairwise distances in real-time rather than pre-calculating and storing the distance matrix in the memory. Therefore it is memory efficient, despite twice as slow (because each pair is calculated twice). It is necessary for handling very large datasets.

In the following example, calculating Silhouette coefficients of ~250k data points assigned to ~250 clusters took about 20 min.

@pavia27 @nujinuji

Screenshot from 2022-06-02 10-58-27