Open szhan opened 3 months ago
After some discussion, we decided that applying a HMM cost filter after grouping newly added samples isn't quite right, for example, it doesn't exclude the Alpha time traveller above.
For now, we will separate HMM cost filtering from match grouping. The next step is to see what samples get filtered out per day using a fairly strict threshold (max threshold of 5, which seems to be reducing the number of mutations by quite a bit) before improving the filtering strategy.
In the latest run, samples are attached to a base ARG (built up to Jan. 31, 2020) as described above using a max HMM cost threshold of 5. Samples excluded due to this threshold are stored in pickle files.
The time traveller Alpha sample (ERR6713815; pointed out above) is gone, but so are the other Alpha samples that are supposed to show up in Sep./Oct. In an ARG built up to Dec. 10, 2020, no Alpha samples have been attached. The threshold is just too strict, I think.
Below is a table showing the number of samples excluded on certain dates. It is pretty clear that an increasing number of samples are excluded over time.
date | number of excluded samples on the date |
---|---|
2020-02-29 | 0 |
2020-03-31 | 5 |
2020-04-30 | 1 |
2020-05-31 | 0 |
2020-06-30 | 34 |
2020-07-31 | 28 |
2020-08-31 | 249 |
2020-09-30 | 357 |
2020-10-31 | 336 |
2020-11-30 | 312 |
2020-12-10 | 1358 |
Also, the pickling step is taking quite some time, it seems, because the run is taking at least one and a half day longer than without pickling the excluded samples.
Great, I think this is working as expected. Now we need to think of a simple way of aggregating the "kicked out" samples over several days, and see if we get a sufficiently large grouping clustering under the same node with ~ the same mutations. That's the signal of a new lineage.
There are supposed to be three Jackson et al. recombinants sampled in Dec. 2020 (see Table 1).
recombinant | number of samples | sampling date |
---|---|---|
Group B | 2 | 2020/12/23–2020/12/24 |
CAMC-CBA018 | 1 | 2020/12/18 |
MILK-103C712 | 1 | 2020/12/18 |
I don't think they are among the recombinants detected in an ARG (up to Dec. 30, 2020) built using the max HMM cost threshold of 5.
EDIT: The Jackson et al. recombinants have Alpha as a parent. Since Alpha was not added to this ARG, the recombinants are not present.
Quick update. In the latest test run using the code from PR #196, four of the Alpha samples collected in Sep. 2020 were added to the trees. Also, we identified two of the Jackson et al. recombinants (the two Group B sequences) as recombinants in our trees, but the two singletons (which have different breakpoints) were not added (likely due to our strict HMM cost).
Because the metadata file of the Viridian dataset does not have strain names or the EPI IDs, I had to manually search for the ENA accessions of the Jackson et al. recombinants. Thankfully, the strain names are in the description text of the ENA entries, and so I was able to get the corresponding ENA accessions. They are below:
ENA accession | strain name | group |
---|---|---|
ERR5058070 | QEUH-CCCB30 | Group B |
ERR5058141 | QEUH-CD0F1F | Group B |
ERR5054123 | CAMC-CBA018 | singleton |
ERR5054082 | CAMC-CB7AB3 | singleton |
So, it looks like the new approach is working, even though singleton, evolutionary unsuccessful recombinants may not be added.
I'll do a bit more testing and also run sc2ts to Feb. 2021 when the XA recombinant emerges.
Awesome. Looks like it's working!
As mentioned in this comment, we can pick up XA in Feb. 2021. So, combined with reconsidering samples, this approach seems to be working.
Since we have been checking whether the new trees identify the Jackson et al. recombinants, it is probably useful to have their ENA run accessions (used as the unique sequence identifiers in our latest runs) and strain names here. I have extracted the ENA accessions and strain names from Table S2 of the Jackson et al. paper below.
ENA run accession | strain name | group | sampling date |
---|---|---|---|
ERR5308556 | ALDP-11CF93B | Group A | 2021-01-30 |
ERR5323237 | ALDP-125C4D7 | Group A | 2021-02-06 |
ERR5414941 | ALDP-130BB95 | Group A | 2021-02-21 |
ERR5404883 | LIVE-DFCFFE | Group A | 2021-02-14 |
ERR5058070 | QEUH-CCCB30 | Group B | 2020-12-23 |
ERR5058141 | QEUH-CD0F1F | Group B | 2020-12-24 |
ERR5272107 | MILK-1166F52 | Group C | 2021-01-24 |
ERR5406307 | MILK-11C95A6 | Group C | 2021-01-30 |
ERR5232711 | QEUH-109B25C | Group C | 2021-01-18 |
ERR5349458 | MILK-126FE1F | Group D | 2021-02-07 |
ERR5335088 | RAND-12671E1 | Group D | 2021-02-02 |
ERR5340986 | RAND-128FA33 | Group D | 2021-02-02 |
ERR5054123 | CAMC-CBA018 | singleton | 2020-12-18 |
ERR5054082 | CAMC-CB7AB3 | singleton | 2020-12-18 |
ERR5304348 | MILK-103C712 | singleton | 2021-01-12 |
ERR5238288 | QEUH-1067DEF | singleton | 2021-01-17 |
To explore how applying a max HMM cost threshold affects ARG building, I've run sc2ts on all the 2020 samples (with full precision collection dates) using varying thresholds (with k = 3). The threshold values are None (the baseline for comparison), 20, 15, 10, and 5.
First, a base ARG was built up to Jan. 31 2020 (inclusive) without a max threshold. Then, that base ARG is extended while applying a max threshold.
High-level comparison of the ARGs
The table below shows a high-level summary of the differences between the ARGs.
diff_* denotes the number of samples, nodes, edges, and mutations reduced compared to the baseline (None).
There are a couple of big jumps going from threshold 15 to 10 and then from 10 to 5. The threshold may be too high at 10 already. I'm not getting much more out of looking at this.
Time-travelling Alpha sample
The sample ERR6713815 seems to be a time-travelling case of Alpha (B.1.1.7) in the Viridian dataset. It is labelled as B.1.1.7 according to the column Viridian_pangolin in the metadata, and it has a collection date of March 01, 2020. The Alpha variant was first identified in Sep. 2020 (ref).
In the ARG built using a threshold of 5, the sample's path has a single parent, but it has 29 mutations, so its HMM cost is 29. Note that none of the mutations are flagged as (immediate) reversions. The time-travelling sample should be filtered out, but the thresholding only applies to newly added samples that are classified as singletons in terms of attachment path and set of immediate reversions. Here, the time traveller was not classified as a singleton because it shares the same parent with other newly added samples which also have no immediate reversions (actually, no mutations at all).
Maybe we have to change how we group newly added samples.