soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

Use std::sort instead of QSortInt #307

Open fuji8 opened 2 years ago

fuji8 commented 2 years ago

I profiled the T1050 with the parameters run by AlphaFold. (Only -cpu is changed). I used Score-P as the profiling tool and got the following results.

pr

From this image, we can see that QSortInt in mergeHitsToQuery is taking a long time. With fast enough storage, my hhblits for this condition is about 2100sec, and QSortInt accounts for about 40%.

Instead of this QsortInt, use std::sort.

I ran hhblits installed by conda and using std::sort under the same conditions as before. In order to avoid I/O effects, I analyze the difference in execution time between the logs that contain this change, instead of the overall execution time. (From https://github.com/soedinglab/hh-suite/blob/ac765987bd0daceb093a41a8e2850887dad5835f/src/hhblits.cpp#L1028-L1030 to https://github.com/soedinglab/hh-suite/blob/b100bb05fe6b920a7cbe58256eef72954d9088bf/src/hhhmm.cpp#L2337)

  conda use std::sort
iteration 1 1232(sec) 477(sec)
iteration 2 631(sec) 374(sec)
iteration 3 306(sec) 232(sec)

This reduced the execution time. I also ran it using the parallelization policy, but the results were not significantly different from std::sort.

This change is due to the different stability of sort, so the execution results may not truly match.

milot-mirdita commented 2 years ago

Cool, thank you!

We have implemented a similar fix in MMseqs2's version of the same code, but haven’t backported it: https://github.com/soedinglab/MMseqs2/blob/d89fcecf9911a99c45ed81c1c0e5054743debc64/src/alignment/MsaFilter.cpp#L212

Could you repeat the benchmark with a stable sort?

fuji8 commented 2 years ago

Thank you for the reply.

I changed the sort to stable_sort and ran it 3 times on hpc in the following environment.

  1 2 3
iteration 1 488(sec) 484(sec) 484(sec)
iteration 2 374(sec) 371(sec) 376(sec)
iteration 3 223(sec) 225(sec) 218(sec)

Because of the large memory, the computational complexity is probably Nlog(N).

martin-steinegger commented 2 years ago

@fuji8 This looks great! Thank you for the PR. Would it be possible to avoid the lambda expression in the sort?

fuji8 commented 2 years ago

I apologize for the delay in response.

I rewrote the code to be almost equivalent without using the lambda expression. I ran it only once, just to be sure.   no lambda
iteration 1 500(sec)
iteration 2 385(sec)
iteration 3 232(sec)