jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
5 stars 3 forks source link

Local tree around the emergence of Delta #313

Open szhan opened 1 month ago

szhan commented 1 month ago

While looking at the earliest HMM group of samples attached in long_arg_v7_clustloc-mrm_2-rw_10-mgs_10-2021-06-30.ts.tsz (md5sum: 6cde6e2c00624a505aa00063973368f2), I noticed that the samples have a suspiciously high number of ambiguous characters (specifically K).

{'-': 13, 'A': 8892, 'C': 5458, 'G': 5845, 'K': 1, 'N': 121, 'T': 9573}
{'-': 13, 'A': 8892, 'C': 5458, 'G': 5845, 'K': 1, 'N': 121, 'T': 9573}
{'-': 13, 'A': 8891, 'C': 5458, 'G': 5845, 'K': 1, 'N': 121, 'R': 1, 'T': 9573}
{'-': 1, 'A': 8894, 'C': 5468, 'G': 5842, 'K': 1, 'N': 121, 'T': 9576}
{'-': 13, 'A': 8837, 'C': 5442, 'G': 5815, 'K': 1, 'N': 268, 'T': 9527}
{'A': 8885, 'C': 5462, 'G': 5844, 'N': 130, 'T': 9582}
{'-': 1, 'A': 8793, 'C': 5412, 'G': 5796, 'N': 413, 'T': 9488}
{'-': 4, 'A': 8889, 'C': 5465, 'G': 5848, 'K': 1, 'N': 121, 'T': 9575}
{'A': 8231, 'C': 5081, 'G': 5439, 'N': 2342, 'T': 8810}
{'-': 4, 'A': 8791, 'C': 5412, 'G': 5795, 'N': 413, 'T': 9487, 'Y': 1}
{'A': 8893, 'C': 5471, 'G': 5849, 'N': 104, 'T': 9586}
{'-': 1, 'A': 8793, 'C': 5412, 'G': 5795, 'K': 1, 'N': 413, 'T': 9488}

Also, these samples have a mix of Viridian Pango labels: B.1.617.2, n = 3 B.1.617, n = 2 B.1.617.1, n = 7

These samples may be making it harder to build a good local tree around the start of the Delta wave. By being strict on the number of ambiguous character (Viridian_cons_het == 0, ignoring '.'), we may be able to do better here.

szhan commented 1 month ago

Here is sampling frequency when looking at only the samples with Viridian_cons_het == 0 and Viridian_cons_het != .. There are 2,040,650 samples.

viridian_md_no_hets

szhan commented 1 month ago

Since the sampling is quite thin before March 1st, we can probably relax the filter on hets for that part of the ARG. We can impose the het = 0 filter on the samples onwards. The early Delta and closely related sample crop up in March/April.

szhan commented 1 month ago

Viridian_cons_het == 0 is too strict. Lots of samples have at least 1 het.

jeromekelleher commented 1 month ago

What do you suggest so? I might run this over the weekend.

szhan commented 1 month ago

Just noting that we don't have any good samples till March 2021 for Delta B.1.617.2.