jeromekelleher / sc2ts-paper

3 stars 5 forks source link

Choose example of multiple-origin Pango recombinant lineage #55

Closed hyanwong closed 1 year ago

hyanwong commented 1 year ago

We discussed using either XC, XD, XAB (because it is a major component of the XAG plot we are showing), or XS (a deltacron variant)

hyanwong commented 1 year ago

Here's XC (63 nodes in the long subgraph of which 6 are XC):

image

hyanwong commented 1 year ago

Here's XD (22 nodes):

image

hyanwong commented 1 year ago

Here's XAB (185 nodes!):

image

hyanwong commented 1 year ago

And finally XS (55 nodes):

image

hyanwong commented 1 year ago

Seems like XD is the easiest to describe?

jeromekelleher commented 1 year ago

What do we get when we exclude single sample recombination nodes? (See latest version of manuscript)

hyanwong commented 1 year ago

What do we get when we exclude single sample recombination nodes?

WeIl, you can imagine what would happen e.g. in the XD plot. But I think it would be a bit hard to do in principle because then we remove bits of the topology. E.g. in the XD plot it would remove the recombination node to the left of the right-hand XD cluster of 2 nodes, but then would we attach sample tsk584939 directly to the recombination node above the XD nodes? Or just pretend it wasn't there?

I guess we could produce a new ARG with all the samples directly below a recombination node removed (and the recombination node removed too). Then we could plot stuff using that. The justification for omitting the samples would perhaps be that they could be artifactual.

hyanwong commented 1 year ago

I guess we could produce a new ARG with all the samples directly below a recombination node removed (and the recombination node removed too)

Here's some code to do that. If we use this, we should do the same for all the node graphs, I guess (edit - corrected from https://github.com/jeromekelleher/sc2ts-paper/issues/57#issuecomment-1511625421)

def remove_singleton_samples_below_re_nodes(ts):
    is_sample_leaf = np.zeros(ts.num_nodes, dtype=bool)
    is_sample_leaf[ts.samples()] = True
    is_sample_leaf[ts.edges_parent] = False
    # parent IDs of sample leaves
    sample_leaf_parents = ts.edges_parent[np.isin(ts.edges_child, np.flatnonzero(is_sample_leaf))]
    # get repeated parent IDs, one for each edge leading to the parent
    sample_leaf_parents = ts.edges_parent[np.isin(ts.edges_parent, sample_leaf_parents)]
    sample_leaf_parents, counts = np.unique(sample_leaf_parents, return_counts=True)
    single_sample_leaf_parents = sample_leaf_parents[counts == 1]
    # Find the ones that are also recombination nodes
    re_nodes = np.flatnonzero(ts.nodes_flags & sc2ts.NODE_IS_RECOMBINANT)
    single_sample_leaf_re = np.intersect1d(re_nodes, single_sample_leaf_parents)
    bad_samples = ts.edges_child[np.isin(ts.edges_parent, single_sample_leaf_re)]
    # All of these should be samples, because they were defined via single edges above a sample
    assert len(np.setdiff1d(bad_samples, ts.samples())) == 0
    keep = np.setdiff1d(ts.samples(), bad_samples)
    # GISAID EPI ids in metadata refer to node numbers, so we need to keep the same
    # numbering using filter_nodes=False
    return ts.simplify(keep, keep_unary=True, filter_nodes=False)

image

jeromekelleher commented 1 year ago

That looks like an interesting example - shows three parents in both cases, and we're not distracted by possible errors etc by the singletons?

hyanwong commented 1 year ago

That looks like an interesting example - shows three parents in both cases, and we're not distracted by possible errors etc by the singletons?

It's the simplest one I could find for Pango X lineages with "simple" multiple origins. Slightly annoying that both are 3-parent recombinations, but I think this is because the BA.1.15 or BA.1.17 insertions are in the middle of the genome