Identify sensible max HMM cost thresholds

szhan commented 3 months ago

To explore how applying a max HMM cost threshold affects ARG building, I've run sc2ts on all the 2020 samples (with full precision collection dates) using varying thresholds (with k = 3). The threshold values are None (the baseline for comparison), 20, 15, 10, and 5.

First, a base ARG was built up to Jan. 31 2020 (inclusive) without a max threshold. Then, that base ARG is extended while applying a max threshold.

High-level comparison of the ARGs

The table below shows a high-level summary of the differences between the ARGs.

threshold	num_samples	num_nodes	num_edges	num_mutations	diff_samples	diff_nodes	diff_edges	diff_mutations
None	173216	195179	195252	141778	0	0	0	0
20	173191	195130	195200	141460	25	49	52	318
15	173108	195010	195074	140561	108	169	178	1217
10	172483	194067	194118	136062	733	1112	1134	5716
5	169428	189673	189703	123736	3788	5506	5549	18042

diff_* denotes the number of samples, nodes, edges, and mutations reduced compared to the baseline (None).

There are a couple of big jumps going from threshold 15 to 10 and then from 10 to 5. The threshold may be too high at 10 already. I'm not getting much more out of looking at this.

Time-travelling Alpha sample

The sample ERR6713815 seems to be a time-travelling case of Alpha (B.1.1.7) in the Viridian dataset. It is labelled as B.1.1.7 according to the column Viridian_pangolin in the metadata, and it has a collection date of March 01, 2020. The Alpha variant was first identified in Sep. 2020 (ref).

In the ARG built using a threshold of 5, the sample's path has a single parent, but it has 29 mutations, so its HMM cost is 29. Note that none of the mutations are flagged as (immediate) reversions. The time-travelling sample should be filtered out, but the thresholding only applies to newly added samples that are classified as singletons in terms of attachment path and set of immediate reversions. Here, the time traveller was not classified as a singleton because it shares the same parent with other newly added samples which also have no immediate reversions (actually, no mutations at all).

Maybe we have to change how we group newly added samples.

szhan commented 3 months ago

After some discussion, we decided that applying a HMM cost filter after grouping newly added samples isn't quite right, for example, it doesn't exclude the Alpha time traveller above.

For now, we will separate HMM cost filtering from match grouping. The next step is to see what samples get filtered out per day using a fairly strict threshold (max threshold of 5, which seems to be reducing the number of mutations by quite a bit) before improving the filtering strategy.

szhan commented 2 months ago

In the latest run, samples are attached to a base ARG (built up to Jan. 31, 2020) as described above using a max HMM cost threshold of 5. Samples excluded due to this threshold are stored in pickle files.

The time traveller Alpha sample (ERR6713815; pointed out above) is gone, but so are the other Alpha samples that are supposed to show up in Sep./Oct. In an ARG built up to Dec. 10, 2020, no Alpha samples have been attached. The threshold is just too strict, I think.

Below is a table showing the number of samples excluded on certain dates. It is pretty clear that an increasing number of samples are excluded over time.

date	number of excluded samples on the date
2020-02-29	0
2020-03-31	5
2020-04-30	1
2020-05-31	0
2020-06-30	34
2020-07-31	28
2020-08-31	249
2020-09-30	357
2020-10-31	336
2020-11-30	312
2020-12-10	1358

Also, the pickling step is taking quite some time, it seems, because the run is taking at least one and a half day longer than without pickling the excluded samples.

jeromekelleher commented 2 months ago

Great, I think this is working as expected. Now we need to think of a simple way of aggregating the "kicked out" samples over several days, and see if we get a sufficiently large grouping clustering under the same node with ~ the same mutations. That's the signal of a new lineage.

szhan commented 2 months ago

There are supposed to be three Jackson et al. recombinants sampled in Dec. 2020 (see Table 1).

recombinant	number of samples	sampling date
Group B	2	2020/12/23–2020/12/24
CAMC-CBA018	1	2020/12/18
MILK-103C712	1	2020/12/18

I don't think they are among the recombinants detected in an ARG (up to Dec. 30, 2020) built using the max HMM cost threshold of 5.

EDIT: The Jackson et al. recombinants have Alpha as a parent. Since Alpha was not added to this ARG, the recombinants are not present.

szhan commented 2 months ago

Quick update. In the latest test run using the code from PR #196, four of the Alpha samples collected in Sep. 2020 were added to the trees. Also, we identified two of the Jackson et al. recombinants (the two Group B sequences) as recombinants in our trees, but the two singletons (which have different breakpoints) were not added (likely due to our strict HMM cost).

Because the metadata file of the Viridian dataset does not have strain names or the EPI IDs, I had to manually search for the ENA accessions of the Jackson et al. recombinants. Thankfully, the strain names are in the description text of the ENA entries, and so I was able to get the corresponding ENA accessions. They are below:

ENA accession	strain name	group
ERR5058070	QEUH-CCCB30	Group B
ERR5058141	QEUH-CD0F1F	Group B
ERR5054123	CAMC-CBA018	singleton
ERR5054082	CAMC-CB7AB3	singleton

So, it looks like the new approach is working, even though singleton, evolutionary unsuccessful recombinants may not be added.

I'll do a bit more testing and also run sc2ts to Feb. 2021 when the XA recombinant emerges.

jeromekelleher commented 2 months ago

Awesome. Looks like it's working!

szhan commented 2 months ago

As mentioned in this comment, we can pick up XA in Feb. 2021. So, combined with reconsidering samples, this approach seems to be working.

szhan commented 1 month ago

Since we have been checking whether the new trees identify the Jackson et al. recombinants, it is probably useful to have their ENA run accessions (used as the unique sequence identifiers in our latest runs) and strain names here. I have extracted the ENA accessions and strain names from Table S2 of the Jackson et al. paper below.

ENA run accession	strain name	group	sampling date
ERR5308556	ALDP-11CF93B	Group A	2021-01-30
ERR5323237	ALDP-125C4D7	Group A	2021-02-06
ERR5414941	ALDP-130BB95	Group A	2021-02-21
ERR5404883	LIVE-DFCFFE	Group A	2021-02-14
ERR5058070	QEUH-CCCB30	Group B	2020-12-23
ERR5058141	QEUH-CD0F1F	Group B	2020-12-24
ERR5272107	MILK-1166F52	Group C	2021-01-24
ERR5406307	MILK-11C95A6	Group C	2021-01-30
ERR5232711	QEUH-109B25C	Group C	2021-01-18
ERR5349458	MILK-126FE1F	Group D	2021-02-07
ERR5335088	RAND-12671E1	Group D	2021-02-02
ERR5340986	RAND-128FA33	Group D	2021-02-02
ERR5054123	CAMC-CBA018	singleton	2020-12-18
ERR5054082	CAMC-CB7AB3	singleton	2020-12-18
ERR5304348	MILK-103C712	singleton	2021-01-12
ERR5238288	QEUH-1067DEF	singleton	2021-01-17

jeromekelleher / sc2ts

Identify sensible max HMM cost thresholds #188

High-level comparison of the ARGs

Time-travelling Alpha sample