kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
53 stars 17 forks source link

Provide an alternative bkgd_cntl_nn function (bkgd_cntl_nn3), with reduced memory usage and increased speed #79

Open sgp79 opened 1 year ago

sgp79 commented 1 year ago

See issue #77

The modified version of bkgd_cntl_nn3 greatly reduces memory usage and increases speed compared to bkgd_cntl_nn2. Memory usage is reduced by summing regex hits (and calculating weighted sums) on the fly, rather than after all hits have been found. Speed is increased by pre-compiling regex.

bkgd_cntl_nn3 calls the new _multi_regex_weighted function rather than _multi_regex. This takes a compiled regex, target seqs, and optionally weights, and returns a tuple giving sum of hits and weighted sum (the latter value is None if no weights were supplied).

Tests are provided to confirm that bkgd_cntl_nn2 and bkgd_cntl_nn3 produce the same results (in test_bkgd_cntl_nn3_vs_bkgd_cntl_nn2.py).

kmayerb commented 1 year ago

@sgp79. Thanks for the efforts to improve efficiency of tcrdist3. The test_bkgd_cntl_nn3_vs_bkgd_cntl_nn2.py failed. I will try to figure out why, but have limited bandwidth to work on tcrdist3 this week, so thoughts on why this test failed would be appreciated.

sgp79 commented 1 year ago

@kmayerb I'll take a look.  It might be the test setup stage (i.e. getting the test data), as this bit may end up being system specific (I wrote it to be portable, but didn't test that part extensively). Anyway, I'll let you know what I find.

sgp79 commented 1 year ago

@kmayerb Fixed (I hope): I'd forgotten that dill wasn't part of the standard library, and test_bkgd_cntl_nn3_vs_bkgd_cntl_nn2.py was using it to pickle the tr and tr_background objects. There was no reason not to just hold them in memory, so I've rewritten it to do this, and removed the dill import (I removed some unnecessary csv read/writing at the same time).