scharch / SONAR

Software for Ontogenic aNalysis of Antibody Repertoires
GNU General Public License v3.0
17 stars 10 forks source link

Long processing time on 1.5_single_cell_statistics.py #19

Open asjureka opened 1 year ago

asjureka commented 1 year ago

Hi,

I was recently running the SONAR workflow on our HPC cluster, and I noticed that it was taking quite a long time to finish that step. When I went through the script, it appears that this particular script isn't threaded (or not obviously, please correct me if I'm wrong). Would it be possible to add threading to this script to help it utilize available processing power more efficiently?

Thank you!

scharch commented 1 year ago

Yes, this is a known issue. The main cause is large 10x datasets with droplets containing 10s of (usually light) chains that trigger an inordinate number of alignments trying to collapse them. Despite the function name, it is actually extremely slow, writing to disk, calling muscle, and then reading the output.

I have some ideas about how to fix this, which I hope to include as part of a major overhaul/refactor of module 1 planned for the next major version of SONAR, but I don't have any sort of timeline for release yet.

In the meantime, you are welcome to submit a pull request to add threading, though I'm not sure how much of a speed up you can get that way. A better approach would probably be to filter your rearrangements TSV prior to calling to 1.5 to remove "cells" with more than 5-10 chains present --they are likely to be background noise/unsalvageable, anyway.