Nodes with larger polytomies in the Viridian ARG

hyanwong commented 2 months ago

From Slack:

I was worried that the new ARG construction procedure was creating larger polytomies than before (which could explain why we are tracing recombinants back to MRCAs at only 2 timepoints). This does indeed seem to be the case. Here are the arities (on the x axis: note this is a log scale) against time, with the intensity of point colour weighted by the number of children. You can see that the Viridian ARG (top) has a few very large polytomies that aren't present in the old long ARG

This is presumably because of the exclusion criteria: we are not breaking up the start of new outbreaks as much as we were previously doing, because we are excluding a few of the earliest nodes in an outbreak. It's only down to a few specific nodes, but I suspect it is a real pattern.

The code for producing is very simple, just involving 2 calls to np.unique:

fig, axes = plt.subplots(2, 1, figsize=(10, 10))
max_arity = 1e3
for ax, tmp_ts, label in zip(axes, (ts, old_ts), ("Viridian", "old")):
    unique_pc = np.unique(np.array((tmp_ts.edges_parent, tmp_ts.edges_child)), axis=1)
    node_ids, arities = np.unique(unique_pc[0], return_counts=True)
    ar = arities/max_arity
    ax.scatter(arities, tmp_ts.nodes_time[node_ids], alpha=np.where(ar > 1, 1, ar))
    ax.set_xscale("log")
    ax.set_ylabel(f"time of {label} node")
    ax.set_xlim(3, 7000)

axes[1].set_xlabel("number of node children")

hyanwong commented 2 months ago

Here are the plots for the "wide" equivalent. Sadly it does seem to be persisting even with greater sampling density.

hyanwong commented 2 months ago

I can't for the moment think of any easy tweaks that could result in a better breaking-up of polytomies, other than re-running the matching over the time periods of interest with fewer excluded samples, but perhaps there are other things we could do? As there are only a few of these huge 1,000+ child polytomies, It may be worth zooming in on one or two and seeing what could have broken them up?

jeromekelleher commented 2 months ago

We could try widening the retrospective window to 60 days?

Would be good to identify specific example here so I can look back through the log and see if we could have done better

hyanwong commented 2 months ago

I updated the plot to show node numbers. The most egregious examples are these nodes in the long ARG:

70 203014 555496 583808 586814

Is that enough to identify the examples? I'm assuming you can check that these node IDs all have >3000 unique children in the new long ARG?

hyanwong commented 2 months ago

We could try widening the retrospective window to 60 days?

That could make a difference, although I expect that the nodes which could potentially break the large polytomies are quite close in time to the polytomies themselves, so I wouldn't be surprised if we still ended up with substantial polytomies.

jeromekelleher commented 2 months ago

I think we're getting bigger polytomies because the data quality is better. Using this code I've had a look at the distributions of branch lengths and numbers of mutations on those nodes (I'll add it to the TreeInfo at some point):

def examine_polytomy(focal):
    tree = ts.first()
    node_mutations = {}

    for v in tree.children(focal):
        edge = ts.edge(tree.edge(v))
        #assert edge.left == 0 and edge.right == ts.sequence_length
        node_mutations[v] = sc2ts.node_mutation_descriptors(ts, v)

    mutations_above_focal = len(sc2ts.node_mutation_descriptors(ts, focal))
    num_reversions = 0
    num_mutations = 0
    for mut_map in node_mutations.values():
        for mut_id in mut_map.values():
            num_mutations += 1
            num_reversions += ti.mutations_is_reversion[mut_id]        

    fig, ax = plt.subplots(1, 2, figsize=(10,4))
    fig.suptitle(f"Polytomy details for node {focal}. Has {mutations_above_focal} mutations above,"
                f"{num_mutations} below ({num_reversions} reversions)")
    ax[0].hist([len(muts) for muts in node_mutations.values()] ,range(7), rwidth=0.7, align="left");
    ax[0].set_xlabel('Number of mutations on child branches');
    #ax[0].set_title("Number of mutations on branches");
    ax[1].hist([tree.branch_length(v) for v in tree.children(focal)], range(250), label="all");
    ax[1].hist([tree.branch_length(v) for v in tree.children(focal) if len(node_mutations[v]) == 0], range(250),
              label="exact matches");
    ax[1].set_xlabel('Length of child branches');
    #ax[1].set_title("Branch length of children in polytomy");
    ax[1].legend();

Here's what we get for node 70. Notice that for 2500 of the children there are zero mutations. There are 2500 samples that were added into the dataset that are exact matches of that node. That's not an "egregious" polytomy, that just a reflection of the actual thing that happened. Of the remaining children, over 1000 have one mutation, all of which are unique.

There's possibly some more mutation sharing in the other children, but not a great deal I would guess because the mutation overlap hueristics will have had a lot of goes at fixing this up.

So, in my opinion node 70 is correctly reflecting biological reality by having a large number of children and there's nothing to be done.

node70

jeromekelleher commented 2 months ago

Node 203014 is a bit more interesting node203014 ay 4

Pretty odd that the child branch length distribution should be centered around 150 days, right? This must be a time traveller?

Anyway, the reason that the polytomy is large is mainly because there's a lot of exact or very close matches.

hyanwong commented 2 months ago

So, in my opinion node 70 is correctly reflecting biological reality by having a large number of children and there's nothing to be done.

Yes, thanks for looking into that. That does seem very conclusive in that case.

jeromekelleher commented 2 months ago

I'm not sure what to make of these three:

node586814 node583808 node555496

In all these cases anyway, it's the number of exact matches and unique one mutation differences that dominates, so I don't think the size of a polytomy tells us very much. Large polytomies just flag large outbreaks.

Probably what's more interesting is how we're doing on branches with lots of mutations leading to major clades. That's a likely place that a lack of resolution could have an impact.

jeromekelleher commented 2 months ago

Digging in a bit more to the interesting one above it does look suspiciously like a bunch of correlated time travellers have tripped us up.

First seen 2021-01-09

hmm_cost=27.0 ERR6805163 2021-01-09 AY.4 path=(0:29904, 16853) mutations(27)=35C>T,2317A>C,4181G>T...

Added on 2021-01-10

2024-09-09 23:18:01,459 [222851] DEBUG    sc2ts.inference: Adding group of 20 with path=[PathSegment(left=0, right=29904, parent=16853)] and reversions=()
2024-09-09 23:18:01,461 [222851] DEBUG    sc2ts.inference: Adding ERR7698572:2020-12-24 with 38 mutations
2024-09-09 23:18:01,462 [222851] DEBUG    sc2ts.inference: Adding ERR7716907:2021-01-01 with 30 mutations
2024-09-09 23:18:01,462 [222851] DEBUG    sc2ts.inference: Adding SRR17623486:2021-01-01 with 28 mutations
2024-09-09 23:18:01,462 [222851] DEBUG    sc2ts.inference: Adding SRR22986644:2021-01-02 with 32 mutations
2024-09-09 23:18:01,462 [222851] DEBUG    sc2ts.inference: Adding SRR19511853:2021-01-03 with 32 mutations
2024-09-09 23:18:01,462 [222851] DEBUG    sc2ts.inference: Adding SRR19511864:2021-01-03 with 32 mutations
2024-09-09 23:18:01,462 [222851] DEBUG    sc2ts.inference: Adding SRR20818149:2021-01-04 with 31 mutations
2024-09-09 23:18:01,462 [222851] DEBUG    sc2ts.inference: Adding ERR8234995:2021-01-08 with 30 mutations
2024-09-09 23:18:01,463 [222851] DEBUG    sc2ts.inference: Adding ERR6805122:2021-01-09 with 30 mutations
2024-09-09 23:18:01,463 [222851] DEBUG    sc2ts.inference: Adding ERR6805222:2021-01-09 with 29 mutations
2024-09-09 23:18:01,463 [222851] DEBUG    sc2ts.inference: Adding ERR6805162:2021-01-09 with 29 mutations
2024-09-09 23:18:01,463 [222851] DEBUG    sc2ts.inference: Adding ERR6805099:2021-01-09 with 31 mutations
2024-09-09 23:18:01,463 [222851] DEBUG    sc2ts.inference: Adding ERR6805163:2021-01-09 with 27 mutations
2024-09-09 23:18:01,463 [222851] DEBUG    sc2ts.inference: Adding ERR6805117:2021-01-09 with 27 mutations
2024-09-09 23:18:01,463 [222851] DEBUG    sc2ts.inference: Adding ERR6805223:2021-01-09 with 30 mutations
2024-09-09 23:18:01,463 [222851] DEBUG    sc2ts.inference: Adding ERR6805221:2021-01-09 with 29 mutations
2024-09-09 23:18:01,463 [222851] DEBUG    sc2ts.inference: Adding ERR6805134:2021-01-09 with 35 mutations
2024-09-09 23:18:01,464 [222851] DEBUG    sc2ts.inference: Adding ERR6805230:2021-01-09 with 26 mutations
2024-09-09 23:18:01,464 [222851] DEBUG    sc2ts.inference: Adding ERR6805136:2021-01-09 with 30 mutations
2024-09-09 23:18:01,464 [222851] DEBUG    sc2ts.inference: Adding SRR16571718:2021-01-09 with 32 mutations
2024-09-09 23:18:01,568 [222851] DEBUG    sc2ts.inference: Path (PathSegment(left=0, right=29904, parent=16853),): samples=20 depth=2 mutations=209 reversions=() attach_nodes=[203014]

So the strains are

strains = [
    "ERR7698572",
    "ERR7716907",
    "SRR17623486",
    "SRR22986644",
    "SRR19511853",
    "SRR19511864",
    "SRR20818149",
    "ERR8234995",
    "ERR6805122",
    "ERR6805222",
    "ERR6805162",
    "ERR6805099",
    "ERR6805163",
    "ERR6805117",
    "ERR6805223",
    "ERR6805221",
    "ERR6805134",
    "ERR6805230",
    "ERR6805136",
    "SRR16571718",
]

data = []
for strain in strains:
    data.append(ts.node(ti.strain_map[strain]).metadata)

import pandas as pd
df = pd.DataFrame(data)
df

What looks most suspicious above these is that we have a very different set of pango lineages associated with the 20 strains that form the retrospective group:

>>> collections.Counter(df.Viridian_pangolin)
Counter({'AY.103': 2,
         'AY.126': 2,
         'AY.43': 2,
         'AY.42': 2,
         'AY.5': 1,
         'AY.4.2': 1,
         'AY.99.1': 1,
         'AY.119.1': 1,
         'AY.44': 1,
         'AY.4.2.2': 1,
         'AY.98.1': 1,
         'AY.124': 1,
         'AY.4': 1,
         'AY.122': 1,
         'AY.23.2': 1,
         'B.1.617.2': 1})

So, I'm inclined to think we have an unlucky combination of time travellers that happen to occur within the time window.

I think it's important that we try to filter these out a bit better, so any thoughts on how we can improve the retrospective hueristics would be much appreciated. Currently we add a batch of samples reprospectively if:

They all have the same attach path
There is at least 20 strains in the group
They occur within the previous 30 days
They occur across more than three distinct days.

The last clause was added to catch the case where a bunch of samples are all misdated together, so you get a large bunch of time travellers on the same day.

Should we also add something about the different Pango lineages we observe as a filter?

szhan commented 2 months ago

What if we just see if the sample sequences cluster by mutations directly instead of relying on the Viridian Pangolin labels (which kind of means we are cheating a bit as we are using signals in the Viridian UShER tree)?

Time travellers bundled together in a retrospective group should likely have different mutational signatures, which is what you are saying above @jeromekelleher (I think?).

I tried to look at the mutations in the node metadata for the strains, but could only find eight of them by iterating through the nodes. (We are looking at viridian-full-mm_3-mhc_5-mgs_20-2021-03-21.ts.tsz right?). All the mutations are different.

I suppose if we were to examine a "good" retrospective group, then we should expect to see that there is a set of mutations that are shared. If this is the case, then I guess we could think about thresholds.

jeromekelleher commented 2 months ago

That is a good point - we needed 209 mutations to add the group of 20 here with a tree of depth 2. Tree building did not go well in this case.

So we could try getting some threshold on tree 'quality' before insertion?

jeromekelleher commented 2 months ago

We could also just try making the required group size larger. We were unlucky here in that we got about 20 time travellers on the same day, within a few days of a bunch of others.

I think I'll try setting the required group size to 100 and set the window to like 20 days, see if this smooths out the branch length distribution.

I'd much rather lose a bit of resolution at the start of a clade, than have it too early and muddled with much later outbreaks (which is what's happening here, I think)

jeromekelleher / sc2ts-paper

Nodes with larger polytomies in the Viridian ARG #215