jeromekelleher / sc2ts-paper

3 stars 5 forks source link

To-dos for MRCA figure #3

Closed hyanwong closed 1 year ago

hyanwong commented 1 year ago

Just to remind myself:

jeromekelleher commented 1 year ago

Label dotted lines by the % age of e.g. B.1 or BA.2 lineages that pass through them. The lineage ID to be taken from the most common ID among all their children.

Can use the imputed pango lineage of the node itself.

hyanwong commented 1 year ago

To find e.g. the %age of B.1 samples that have node 179 in their ancestry, I guess I would have to (a) find the sample ids of all B.1 samples, (b) traverse up through all the trees looking for node 179 (but I could stop once the parent was older than 179).

That's going to be very tedious to do for all the trees, I think, @jeromekelleher ? Is there a faster way, since most of the trees will not differ from each other very much? Is any of this retained in the TreeInfo class?

hyanwong commented 1 year ago

Ah, I guess I could use the tracked_samples feature for ts.trees() to help with this. Then I could relatively quickly find e.g. the maximum number of B.1 samples in any one tree, although I think I'd still find it hard to get the total number of B.1 descendant samples over all the trees?

hyanwong commented 1 year ago

Actually, this does it relatively quickly: simplify down to the required samples and then count them over the trees. The slow bit is actually identifying the samples of each lineage type:

focal_nodes = (179, 253063, 559612, 519829, 1)
focal_lineages = {u: ts_long.node(u).metadata['Imputed_lineage'] for u in focal_nodes}
sample_lists = {u: [] for u in focal_nodes}
for nd in tqdm.tqdm(ts_long.nodes()):
    if nd.is_sample():
        lineage = nd.metadata.get('Nextclade_pango', "")
        for k, v in focal_lineages.items():
            if lineage == v:
                sample_lists[k].append(nd.id)
for focal_node, samples in sample_lists.items():
    sts, node_map = ts_long.simplify(samples, map_nodes=True, keep_unary=True)
    ok = np.zeros(sts.num_samples, dtype=bool)
    assert sts.samples().max() == sts.num_samples - 1
    for tree in sts.trees():
        for u in tree.samples(node_map[focal_node]):
            ok[u] = True
    print(
        f"Recombination node {focal_node}: {np.sum(ok) / len(ok) * 100:.1f}% of",
        f"{len(samples)} {focal_lineages[focal_node]} samples"
    )

This gives (in descending order of number of times the node is an MRCA)

100%|██████████| 783231/783231 [00:11<00:00, 66221.21it/s]
Recombination node 179: 99.6% of 12224 B.1 samples
Recombination node 253063: 66.3% of 12829 B.1.617.2 samples
Recombination node 559612: 98.4% of 58772 BA.2 samples
Recombination node 519829: 0.1% of 8407 B.1.1 samples
Recombination node 1: 100.0% of 1605 B samples

So there's something odd about 519829, which is labelled as B.1.1 but which only has a very few B.1.1 descendants. This could be a badly imputed lineage allocation, I guess. Also not quite as odd, but node 253063 is only 2/3rds of the B.1.617.2 samples.

For reference, node 179 accounts for 794 MRCAs, node 253063 for 170, node 559612 for 129, node 519829 for 115, and node 1 for 98. Node 519829, to which 129 recombinant parent pairs trace, has only 4 children, so is not a big polytomy, and I'm puzzled as to why it's the MRCA of so many recombinants.

jeromekelleher commented 1 year ago

Why not use num_tracked_samples and average that out across the trees?

span_weighted_samples = np.zeros(ts.num_trees)
for tree in ts.trees(tracked_samples=ti.self.pango_lineage_samples["B.1"]):
      count = tree.get_num_tracked_samples(focal_node)
      span_weighted_samples[tree.index] = count * tree.span
return np.sum(span_weighted_samples) / ts.sequence_length

should be quick?

hyanwong commented 1 year ago

I could do that, but it wouldn't be reporting the total of all the samples under a node, but the average number over the trees, which could be quite different (I guess?). It's also harder to explain that in the text, as you start having to talk about per-tree measures, like we had to for the max_descendants thing.

The code above runs in 2 or 3 seconds or so on my laptop, so I'd prefer to quote the more easily explained "proportion of B.1 samples that have this node in their ancestry", rather that "average proportion of B.1 nodes that have this node in their ancestry in each tree, weighted by the span of the tree"

jeromekelleher commented 1 year ago

Interesting results - so 559612 is the root of a large outbreak as expected, but 253063 and 519829 are not. Looks like 519829 might have a badly imputed lineage status, it's worth further examination.

I think node 179 is a pivotal early node that got some of the key mutations, it wouldn't be surprising if that was the MRCA of lots of things.

jeromekelleher commented 1 year ago

The code above runs in 2 or 3 seconds or so on my laptop, so I'd prefer to quote the more easily explained "proportion of B.1 samples that have this node in their ancestry", rather that "average proportion of B.1 nodes that have this node in their ancestry in each tree, weighted by the span of the tree"

If you like - I'd like to confirm the numbers are basically the same though, as a reassurance that the code is correct.

hyanwong commented 1 year ago

I'd like to confirm the numbers are basically the same though, as a reassurance that the code is correct.

Yep, good call:

focal_nodes = (179, 253063, 559612, 519829, 1)
focal_lineages = {u: ts_long.node(u).metadata['Imputed_lineage'] for u in focal_nodes}
for focal_node, lin in focal_lineages.items():
    span_weighted_samples = np.zeros(ts.num_trees)
    tracked = ti.pango_lineage_samples[lin]
    for tree in ts.trees(tracked_samples=tracked):
          count = tree.get_num_tracked_samples(focal_node)
          span_weighted_samples[tree.index] = count * tree.span
    av_per_tree = np.sum(span_weighted_samples) / ts.sequence_length
    print(
        f"Recombination node {focal_node}: {av_per_tree / len(tracked) * 100:.1f}% of",
        f"{len(tracked)} {lin} samples"
    )

giving, as expected, almost the same (but trivially lower) percentages:

Recombination node 179: 99.6% of 12224 B.1 samples
Recombination node 253063: 65.8% of 12829 B.1.617.2 samples
Recombination node 559612: 98.3% of 58772 BA.2 samples
Recombination node 519829: 0.1% of 8407 B.1.1 samples
Recombination node 1: 100.0% of 1605 B samples
jeromekelleher commented 1 year ago

Interesting, 519829 is worth looking at in more detail. So most of it's children can't be B.1.1 then, I guess? It's quite odd that such an old lineage has such a high node number, it must have been added quite recently as a result of some tree building. What's the time distribution of its children?

I think 253063 is just something quite messy, and we're unlikely to get a straightforward story.

Summarising the children of these nodes would be good anyway. What is the (a) pango lineage distribution? (b) the time distribution?

hyanwong commented 1 year ago

Interesting, 519829 is worth looking at in more detail. So most of it's children can't be B.1.1 then, I guess?

It only has 4 immediate children, 2 of which are recombinants, and 2 of which are B.1.1. Here are the grandchildren lineages: all different!

Recombinant 
B.1.621 
B.1 
B.1.617.2 
B.1.1 
B.1.1.409 
Recombinant 
jeromekelleher commented 1 year ago

I dug in a bit, and I think that 519829 is coming out as a result of the BA.1 outbreak. We should be a bit suspect about the time here because there's a lot of parsimony adjustment going on, but since it's anchored by a sample at 240954 it seems OK. Does it make sense for that sample to be a B.1.1 in March 2021?

The node at the top of the list here is the ancestor of 99% of BA.1 samples.

BA 1

hyanwong commented 1 year ago

Hmm, with the corrected seek_index MRCA code, the top 5 have now changed, although the pattern is still largely the same. 519829 is now the fifth most common (rather than 4th most common) MRCA:

most common parent MRCA has id 179 (imputed: B.1) @ time=861.0; num_children=512, av. arity=512.0
second most common parent MRCA has id 253063 (imputed: B.1.617.2) @ time=454.125; num_children=355, av. arity=354.10490235420014
third most common parent MRCA has id 5868 (imputed: B.1) @ time=860.5; num_children=22, av. arity=21.455658105939005
fourth most common parent MRCA has id 559612 (imputed: BA.2) @ time=198.3; num_children=2872, av. arity=2871.8871722846443
fifth most common parent MRCA has id 519829 (imputed: B.1.1) @ time=431.2; num_children=4, av. arity=3.553036383092563

I think the 5868 is just a recombinant node slightly below node 179, so a medium-sized polytomy nested within the huge one at 179. We can probably skip plotting it, TBH.

jeromekelleher commented 1 year ago

Note that 568904 has 24 mutations, and there's only one reversion in the whole path to root here, so this feels like somewhere that tree building is working quite well. It seems odd to call it a B.1.1, so I wonder if something funny is happening with lineage imputation here all right.

hyanwong commented 1 year ago

I dug in a bit, and I think that 519829 is coming out as a result of the BA.1 outbreak. We should be a bit suspect about the time here because there's a lot of parsimony adjustment going on, but since it's anchored by a sample at 240954 it seems OK.

So either we misimputed the lineage, or this is an earlier node that is piggybacking off a slightly downstream BA.1 explosion, but for some reason lots of recombinants trace back to two separate children of this node?

I presume BA.1 has some defining mutations. Are any of these present in the ancestry of 519829?

Does it make sense for that sample to be a B.1.1 in March 2021?

I don't know. Sorry. @szhan might?

jeromekelleher commented 1 year ago

So either we misimputed the lineage, or this is an earlier node that is piggybacking off a slightly downstream BA.1 explosion, but for some reason lots of recombinants trace back to two separate children of this node?

I think it's probably the effective root of BA.1, and the lineage imputation is a bit funny. We'd need Ana or @szhan to comment on the specific lineage defining mutations here. I wonder if the imputation is being thrown off by 240954 which might be misassigned.

hyanwong commented 1 year ago

There's the question of why, with only 4 children, so many recombination nodes (98 of them) trace their MRCA back to this node. It must be that 2 of the children of this node are sufficiently different (or lead to sufficiently large numbers of descendants) that recombination paths trace back to 519829.

I guess the number of descendants of each of the children of 519829 would be useful:

child node 582335 (B.1.1): 192910 descendants
child node 544949 (itself a recombinant): 127230 descendants
child node 563190: 2 descendants
child node 574327: 1 descendant

So perhaps it's a common MRCA because it has 2 children each of which led to a (separate) outbreak? Or a separate part of the same outbreak?

hyanwong commented 1 year ago

I guess the number of descendants of each of the children of 519829 would be useful:

It would be useful to know how much overlap there is between the descendants of child 582335 and the descendants of child 544949. If there is very little overlap, then we have two separate highly successful strains.

hyanwong commented 1 year ago

If there is very little overlap, then we have two separate highly successful strains.

But actually, all the descendants of 544949 are shared with 582335. So it's not two separate clades, but one nested inside another:

c1 = set()
c2 = set()

for tree in tqdm.tqdm(ts_long.trees()):
    c1.update(tree.samples(582335))
    c2.update(tree.samples(544949))
print(len(c1), len(c2), len(c1.intersection(c2)), len(c2-c1))
193290 127267 127267 0