psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

SW alignment slow down in newer version #207

Closed krdav closed 8 years ago

krdav commented 8 years ago

First; huge thanks to Duncan for implementing VL in partis. This is really expanding the usefulness of partis into a new level. Bravo!

Second; (now comes the complaining...) I was running a comparison between my old version of partis (installed back in the beginning of May) and the new partis that can handle VL data (installed three days ago). I am not using the docker version but compile partis myself, which is just as easy thanks to your build script - good job btw.

It turns out that the SW align takes much longer time now than it did before. Also I notice that the bcrham step is much faster in the new version than the old version, but not enough to make it up for the time lost to the SW step. Attached is my stdout logs from partis, they where given the same amount of resources, a full compute node with no interference from other users, and the input data is exactly the same, however as a copy, to make the runs completely separated:

Here is the old version: partis_old.log.txt And here the new: partis_new.log.txt

Then I made another interesting observation, which is that the VL run is much, much faster than the VH run with more or less the same amount of sequences. In this test I used the same settings as the above but as input I used VL fasta sequences and the --chain l flag: partis_new_VL.log.txt

So I wonder if the VL run is faster because the complexity of the VL genes is lower and therefore makes it easier for SW alignment or there is another story behind it.

krdav commented 8 years ago

Also I am curious to know if you have a unit test for partis? E.g. you could be running partitioning on a small sample of 1000 sequences which would test both swalign, bcrham and clustering. That of course requires that partitioning is deterministic, which I assume it is?

psathyrella commented 8 years ago

There was definitely a big (accidental) performance hit in sw. We plugged in a java to c rewrite and I apparently didn't benchmark it enough beforehand. I would've emailed but I was hoping I could fix it before I left last week. I'll push a fix today, though.

There's a fair amount of testing code. It's mostly in test.py. This runs with the --quick option (i.e. just annotation performance) as part of the build process. The script ends up being so complicated mostly because we have to test how the performance of the various annotation and clustering algorithms do on old reference simulation, then test that simulation creation is ok, and then that parameter inference works ok on new simulation, and then that annotation and clustering are ok on this new stuff. The number of sequences we run on is a tradeoff between testing accuracy and making it fast enough that I actually run it regularly (every few commits, depending). It takes about ten minutes to run the full thing (i.e. no options, ./test/test.py) these days, and there's also options to only run on reference results and what-have-you.

I added a check of the elapsed time for each step a few months ago, but the trouble is that different steps take different fractions of the total run time depending on data set size (and a lot of other factors). In this case, s-w isn't a significant fraction on the testing data sets (and also, when each step is only taking 10-100 seconds, there's a lot of variance). Another factor is the testing framework uses a reduced germline set (~30 total genes), which is critical for getting the hmm writing fast enough to make the tests useful -- but at least when i noticed the s-w slowdown last week, the ~factor of seven slowdown was only present with large/full germline sets.

Now as to VL being faster -- I would expect this, because the hmm run time is roughly proportional to (number of v + d + j genes) * (k_v * k_d space volume over which we sum), where the k space volume is described in the paper, but it's basically how much we vary the position of the v-d and d-j boundaries. From looking at the imgt fastas, it seems like kappa and lambda have ~half or ~third as many total genes, and there's no k_d variation, so the k space volume is also much smaller.