sararselitsky / FastPG

Fast phenograph, CyTOF
Other
25 stars 6 forks source link

Replacing nmslibR and Python with RcppHNSW #9

Closed tom-b closed 3 years ago

tom-b commented 3 years ago

RcppHNSW is a wrapper for R around the HNSW c++ library. It replaces the first step of FastPG so that the KNN index is built with the RcppHNSW library instead of the existing nmslibR wrapper that required Python. Testing results look good against the gold standard data (Levine13). Results for 5 iterations sampling 80,000 cells:

[1] "80000" Precision: 0.919297971212903 Recall: 0.8797375

Precision: 0.924845246896492 Recall: 0.890525

Precision: 0.925546942611531 Recall: 0.931975

Precision: 0.916563951809679 Recall: 0.896375

Precision: 0.917757197750322 Recall: 0.871525

On a 10-core Intel machine with 64GB of memory, clustering 1.1 million cells (datamatrix_LungCancer_multiATOM_N1113369.txt from https://data.mendeley.com/datasets/nnbfwjvmvw/draft?a=dae895d4-25cd-4bdf-b3e4-57dd31c11e37) takes 1.5 minutes. Oversampling to 10 million cells from that dataset on the same machine took 11.2 minutes.

I have tried to update all the documentation files as well with the exception of the Docker materials. Those may need closer review and checking to make sure the RcppHNSW dependency is properly included in the fork.