PERF: Speed up distance computations

ElDeveloper commented 2 years ago

Use Pandas vectorized operations to speed up the validation and computation of first distances and distances to baseline.

In a dataset with ~6K samples runtime went from >1 minute to under a second.

All tests pass except for test_first_distances_ecam which fails because of replicate_handling='drop'. As far as I can tell when there's a repeated state the repeated state and the next state are dropped. Which seems odd, but wanted to check if this is expected or if I am misunderstanding something. Seems to me that only the repeated state should be dropped.

For example, when I go to ecam_map_maturity.txt, I noticed that sample 10249.C001.07SS is not in ecam-first-distances even though the state for that sample isn't repeated. Same is true for 10249.C001.17SS which should be included since state 10 is repeated but state 11 isn't. Any help would be great here.

nbokulich commented 2 years ago

thanks @ElDeveloper ! this sounds great.

Re: the test failure with drop, this is the intended behavior: https://github.com/qiime2/q2-longitudinal/blob/639ced89de62b3e3c38d898281f9aaef6c59c248/q2_longitudinal/plugin_setup.py#L91-L95

As the first-distances are calculated across each interval, missing timepoints (including if a timepoint is dropped) effectively cause two first distances to drop (at time t and time t+1). So the examples you gave are correct; these are cases where the first distance at time t+1 cannot be calculated because the preceding timepoint t is missing (because all samples were dropped). So the first distances cannot be calculated at either interval t-1 <--> t or t <--> t+1.

For these reasons, the "drop" option is rather specific and might only be valuable in some situations. "random" is the more suitable option in most real situations, as it will lead to a random sample being selected at those timepoints so that the first distances at both intervals can be calculated. But it sounds like the test is working as originally intended.

ElDeveloper commented 2 years ago

Fair enough, thanks for the explanation! I'll fix the code in that case. Would you also be OK if I changed the documentation in plugin_setup.py? Seems like "will discard all replicated samples" is ambiguous on the fact that t and t+1 are dropped.

nbokulich commented 2 years ago

Would you also be OK if I changed the documentation in plugin_setup.py? Seems like "will discard all replicated samples" is ambiguous on the fact that t and t+1 are dropped.

documentation updates are very welcome 😄

but not that line — a bunch of actions in this plugin share the same parameter descriptions. So this action should either get a unique parameter description, or better describe this behavior in the action description (which IMO would be best, since this is the behavior from any missing samples, not specifically replicate_handling='drop')

ElDeveloper commented 2 years ago

Cheers thanks so much.

Would you be able to help me diagnose one more case. I've updated my code, but now I am seeing that other samples are currently being excluded from the expected output. For example: 10249.C002.09SS, 10249.C002.15SS, 10249.C002.18SS, 10249.C005.07SS, 10249.C008.04SS, 10249.C008.12SS, 10249.C009.16SS, 10249.C011.04SS, 10249.C014.12SS, 10249.C018.03SS, 10249.C018.17SS, 10249.C021.04SS, 10249.C021.05SS, 10249.C021.07SS, 10249.C021.08SS, etc. When I look at the mapping file for a few of these, I don't see them being associated with any of the repeated sample pairs that are dropped. Any hints as to why these are dropped from the expected output?

I've highlighted the first few in the context of the mapping file here:

`10249.C002.09SS`, `10249.C002.15SS`, `10249.C002.18SS`

`10249.C005.07SS`

`10249.C008.04SS`, `10249.C008.12SS`

nbokulich commented 2 years ago

Hey @ElDeveloper , Those first distances are not being calculated because they are missing the immediately preceding timepoint (i.e., it's not being dropped, rather that timepoint is just missing to begin with).

The first distances intervals are meant to be measured across the same time intervals in each subject. So if you have a table like this (rows are samples, cols are months, cells indicate if a sample was collected at that timepoint): subject-id	0	1	2	3	4	5	10
A	+	+	+	+	+	+	+
B	+	+	+	+	+

A has all first distances calculated, but B only has distances starting at timepoint 4 (i.e., samples were collected at 0 and 3, but the immediately preceding timepoint is missing). On the other hand, one could drop those columns and all distances would be calculated for each sample in a table that looks like this: subject-id	0	3	4	5	10
A	+	+	+	+	+
B	+	+	+	+	+

This is the case above, where 10249.C002.09SS is a sample at month 8, but month 7 is missing. If month 7 were missing in all subjects, the distance interval would be from months 6->8. But that's not the case here, as the interval is based on all samples in the table.

Does that make sense?

Note: the intervals are evenly measuring across samples, but the intervals are not necessarily even across time. One big enhancement here could be to also expose an interval parameter, to allow auto (status quo, variable spacing possible), or at explicit intervals (e.g., to ensure that all distances represent 1 month intervals, to set an explicit lag window, or even to skip shorter intervals). I see this as an ENH not needed now, but maybe based on how your code is set up you already have this implemented and it is easier to expose this now instead of totally refactoring.

ElDeveloper commented 2 years ago

Thanks so much for the explanation. I didn't realize all this was going on under the hood. I like the idea of auto vs shared? or something like that. Let me rework this current version and I'll see where I get. 👍🏽

ElDeveloper commented 2 years ago

@nbokulich all tests are passing and should be ready for review now. I've left out the interval selection out for this pass but it is something we could naturally add in a follow-up PR. Looking forward to hearing your thoughts, I know this sort of code can sometimes be hard to read.

ElDeveloper commented 2 years ago

@nbokulich thanks for the detailed review and apologies for the delay getting back to you on these comments. I'll be doing that in the near future. Thanks so much for considering, and for your patience! ⏳ 🚤

lizgehret commented 2 years ago

Hey @ElDeveloper, thanks for all of your hard work here! We are currently doing some PR triage and review right now - are you still in the process of addressing @nbokulich's review comments?

ElDeveloper commented 2 years ago

Hi @lizgehret thanks for the heads up. I have almost replied to all the comments (although I haven't posted the review yet), but there's a few where I still need to do some additional work. I'll respond ASAP.

qiime2 / q2-longitudinal