Reconsidering filtered samples from the past N consecutive days

szhan commented 1 month ago

An inappropriate HMM cost threshold may filter out epidemiologically relevant variants, which is undesirable (e.g. see #188). But , in practice, it is not easy to figure out, or clear, what an appropriate threshold value should be.

To see if we can work around this issue, we are trying a strategy where samples filtered out by the HMM cost threshold from the past N days are reconsidered in a post-processing step at the end of a round of daily extension.

The procedure is as follows:

Given N, fetch the samples (and their already inferred paths and mutations) from the past N consecutive days' worth of pickled files. Days with no filtered samples do not contribute to this pool of reconsidered samples.
Group the reconsidered samples by their paths and immediate reversions as done in add_matching_results.
The samples in groups of size greater than k are attached to the ARG based on their inferred paths (i.e. without needing to rerun HMM matching).
After proceeding to the following day, the oldest filtered samples, which are not attached to the ARG, are excluded in a FIFO fashion.

A good place to do this post-processing step may after calling add_matching_results while calling extend.

If this works, then we should see the first Alpha samples added in Sep., 2020, even though they were filtered out by the HMM cost filter.

szhan commented 1 month ago

After grouping reconsidered samples by their paths and immediate reversions, for each group:

Insert a new node (set its time at some time arbitrarily slightly earlier than the oldest reconsidered sample).
Add new edges connecting the new node to the shared parent nodes of the grouped samples as per their shared path.
Add the shared mutations (including immediate reversions) above the new node.
Attach each of the reconsidered samples to the new node via new edges.

@jeromekelleher I think this sounds about right? I think we should keep the immediate reversions here.

jeromekelleher commented 1 month ago

I was imagining that we'd reuse the same local tree building technology as we have in the usual attachment case. I guess we'd have to change node times, but otherwise should work?

szhan commented 1 month ago

add_matching_results has been modified to take the argument min_group_size to exclude grouped matches having fewer than a specified number of samples.

szhan commented 1 month ago

We have decided to handle sampling node dates at a later issue. The focus now is to see whether this strategy adds the early Alpha samples.

szhan commented 1 month ago

I think it is working! Using min group size of 2 and looking at the past 5 days worth of filtered samples, these three Alpha samples collected in early Sep. 2020 got attached on 2020-09-25.

strain	date
ERR4659819	2020-09-20
ERR4682028	2020-09-21
ERR5217634	2020-09-23

jeromekelleher commented 1 month ago

Wahey! How much other stuff gets attached at the same time?

jeromekelleher commented 1 month ago

As in, how many other samples also get added back in besides the ones we think should be added in?

szhan commented 1 month ago

Hmm, it seems like quite a lot of samples are being added back in.

date	filtered samples added back
2020-09-24	1750
2020-09-25	2670
2020-09-26	2423
2020-09-27	1868
2020-09-28	1489
2020-09-29	1175
2020-09-30	42
2020-10-01	43

szhan commented 1 month ago

The numbers above are misleading, because they count excluded samples that may be added back in already. Still, it seems like samples (besides the ones we think should be added back in) are being added back in.

szhan commented 1 month ago

One thing clearly missing is to make sure that the same excluded samples don't get added back in multiple times.

szhan commented 1 month ago

There are five Alpha samples collected in Sep. 2020 in the Viridian dataset. The three above are added. I'm looking at the other two samples both collected on 2020-09-30. One got added in but the other didn't, and for some reason the other one is not in the excluded samples file. I think all five Alpha samples should be added by 2020-10-01, so I'm investigating this further before doing anything else.

szhan commented 1 month ago

Ah, the sample (ERR5071073) is listed in the metadata, but not in the input alignments because it got filtered out for having too many Ns in the consensus sequence. Okay, I think the four Alpha samples that should be there are added in.

szhan commented 1 month ago

Also, this function needs to be modified.

def last_date(ts):
    if ts.num_samples == 0:
        # Special case for the initial ts which contains the
        # reference but not as a sample
        u = ts.num_nodes - 1
    else:
        u = ts.samples()[-1]
    node = ts.node(u)
    assert node.time == 0
    return parse_date(node.metadata["date"])

It is getting the last date among the samples based on the last node added. In the previous pipeline where filtered samples are not reconsidered, then it is getting the latest date among the samples. But when we add samples from the past N days, then I don't think it is returning the latest date among the samples anymore.

jeromekelleher commented 1 month ago

Ah, good catch. I guess this is a motivation for getting the time of the extra added sample nodes correct. Then, we can define the last_date as

samples = ts.samples()
time_0 = samples[ts.nodes_time[samples] == 0]
node = ts.node(samples[0])  # Arbitrarily pick first

jeromekelleher commented 1 month ago

To get stuff working, what you could do is:

max(parse_date(node.metadata["date"] for node in ts.nodes())

Slow, but will work ok for prototyping.

szhan commented 1 month ago

I was thinking the former.

def last_date(ts):
    if ts.num_samples == 0:
        # Special case for the initial ts which contains the
        # reference but not as a sample
        u = ts.num_nodes - 1
    else:
        samples = ts.samples()
        u = samples[ts.nodes_time[samples] == 0][0]
    node = ts.node(u)
    assert node.time == 0
    return parse_date(node.metadata["date"])

jeromekelleher commented 1 month ago

Yeah, that works, but you'll have samples in there with time=0 but aren't from the most recent day.

szhan commented 1 month ago

Ah, okay. Gotta fix that later too.

szhan commented 1 month ago

It may be useful to promote collection date to an attribute in the Sample class rather than keeping it as an entry in the metadata, because we are accessing the collection date pretty often.

EDIT: Probably will do this later, since I don't feel like changing it and then needing to update other parts of the code.

szhan commented 1 month ago

Okay, the list of reconsidered samples are now being updated FIFO. The thing left to do is to make sure the times of the excluded samples added back in are not the time when they are added.

szhan commented 1 month ago

Should we tackle the node times of the added excluded samples in a separate issue?

szhan commented 1 month ago

Also, I think the CLI command daily-extend should take excluded_sample_dir as a separate argument from output_prefix.

jeromekelleher commented 1 month ago

I think we probably want to use a SQLite DB to store the excluded samples. This will make it easier to keep track of what's in there, and which ones have been inserted back in. I can do this if you have a working prototype based on pickle files.

jeromekelleher commented 1 month ago

Should we tackle the node times of the added excluded samples in a separate issue?

That's up to you - if the code is working well enough for evaluation purposes, I'm happy to kick the node dating can down the road.

szhan commented 1 month ago

I'll tackle the node times in a separate issue right after PR #194 gets merged. One reason I want to tackle it a bit later is because I want to get new 2020 trees for Yan's workshop ASAP and so want to run what we got now.

szhan commented 1 month ago

Also, we should probably log the number of excluded samples, reconsidered samples, and added samples per round of daily extension.

szhan commented 1 month ago

I think we probably want to use a SQLite DB to store the excluded samples. This will make it easier to keep track of what's in there, and which ones have been inserted back in. I can do this if you have a working prototype based on pickle files.

Like we store the Sample objects as BLOBs and query them by collection date like we do with sample metadata?

jeromekelleher commented 1 month ago

Pretty much.

szhan commented 1 month ago

I think these are the TODO items before closing this issue.

[x] Fix the node times of added-backed samples.
[ ] Store pickled Sample objects in a SQLite database.
[ ] Log the number of excluded, reconsidered, and added-back samples per iteration.

szhan commented 1 month ago

Sorry, I had to fix adding more reconsidered samples. #195

szhan commented 1 month ago

The trees built using the code with PR #196 have the four Alpha samples in Sep. 2020. Also, by Dec. 31, 2020, there are 22,136 Alpha samples. So, it is working, I think.

EDIT: Note also that these trees contain 186,951, or ~70%, of the 2020 samples.

szhan commented 1 month ago

Quick update. Trees built (using max HMM cost of 5, min group size of 2, and reconsider past 5 days of excluded samples) from the samples collected up to and including Feb. 19, 2021 contain XA as a recombinant. I'll take a closer look at these trees.

jeromekelleher / sc2ts

Reconsidering filtered samples from the past N consecutive days #193