kmayerb / tcrdist3

flexible CDR based distance metrics
MIT License
53 stars 17 forks source link

Memory usage of regex testing #77

Open sgp79 opened 1 year ago

sgp79 commented 1 year ago

Testing regex in bkgd_cntl_nn2 can end up using a lot of memory with large search sets, since a list giving hit/miss against each background sequence is created for each regex. As all that is needed by subsequent code is the sum (and weighted sum) of hits, memory usage can be greatly reduced by calculating these sums within the _multi_regex function, and returning these rather than the lists.

To address this, I've made a fork with an altered version of _multi_regex [_multi_regex_weighted] and of bkgd_cntl_nn2 [bkgd_cntl_nn3], along with a test to check that bkgd_cntl_nn2 and bkgd_cntl_nn3 give the same result [tests.test_bkgd_cntl_nn3_vs_bkgd_cntl_nn2.py].

In my testing (with filprofiler), this reduces peak memory usage by 40% (763Mb to 459Mb) with a 435 sequence test set and a 100000 sequence background, and by 70% (1240Mb to 376Mb) with a 1038 sequence test set and a 100000 sequence background. The tests with the larger set were run using sparse matrices, which probably explains why bkgd_cntl_nn3 results in lower peak memory usage despite there being more sequences.

The _multi_regex_weighted function takes a compiled regex rather than an uncompiled one, and this seems to result in a roughly 2-fold speed-up when bkgd_cntl_nn2 is run with test_regex==True (80s to 38s.)

@kmayerb, does this sound useful to you? Would you like me to open a pull request?

kmayerb commented 1 year ago

Thanks you for your detailed explanation. Yes please provide a PR and I will try to incorporate it. Best, K

On Mon, Oct 17, 2022 at 6:51 AM sgp79 @.***> wrote:

Testing regex in bkgd_cntl_nn2 can end up using a lot of memory with large search sets, since a list giving hit/miss against each background sequence is created for each regex. As all that is needed by subsequent code is the sum (and weighted sum) of hits, memory usage can be greatly reduced by calculating these sums within the _multi_regex function, and returning these rather than the lists.

To address this, I've made a fork with an altered version of _multi_regex [_multi_regex_weighted] and of bkgd_cntl_nn2 [bkgd_cntl_nn3], along with a test to check that bkgd_cntl_nn2 and bkgd_cntl_nn3 give the same result [ tests.test_bkgd_cntl_nn3_vs_bkgd_cntl_nn2.py].

In my testing (with filprofiler), this reduces peak memory usage by 40% (763Mb to 459Mb) with a 435 sequence test set and a 100000 sequence background, and by 70% (1240Mb to 376Mb) with a 1038 sequence test set and a 100000 sequence background. The tests with the larger set were run using sparse matrices, which probably explains why bkgd_cntl_nn3 results in lower peak memory usage despite there being more sequences.

The _multi_regex_weighted function takes a compiled regex rather than an uncompiled one, and this seems to result in a roughly 2-fold speed-up when bkgd_cntl_nn2 is run with test_regex==True (80s to 38s.)

@kmayerb https://github.com/kmayerb, does this sound useful to you? Would you like me to open a pull request?

— Reply to this email directly, view it on GitHub https://github.com/kmayerb/tcrdist3/issues/77, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALD2PV5XX4XMXFOGKKTSDJDWDVKXLANCNFSM6AAAAAARHDBSWI . You are receiving this because you were mentioned.Message ID: @.***>

sgp79 commented 1 year ago

Done!