jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Ways to correct "myopic" tree building #77

Open hyanwong opened 1 year ago

hyanwong commented 1 year ago

We currently miss opportunities to create nodes.

For example in the current tree build, we ignore a shared mutation at position 9389 linking "Singapore/883/2020" with "Bahrain/920845574/2021". This is used by Nextstrain to group these B.1.1 samples together. Here's the viz code for testing purposes.

# Inspect the B.1.1 polytomy:
base_sc2_time = "2021-06-30"  # equivalent of day 0 in the sc2_ts file
sc2_ts = tszip.decompress(f"../results/upgma-full-md-30-mm-3-{base_sc2_time}.ts.tsz")

keep={"Bahrain/920858679/2021", "Singapore/883/2020", "Ireland/D-NVRL-71IRL07343/2020", "Bahrain/920845574/2021"}

B11 = sc2_ts.simplify([
    n.id for n in sc2_ts.nodes() if n.metadata.get("strain", "") in keep and n.is_sample()],
)

B11.at(9389).draw_svg(
    size=(1000, 300),
    node_labels={n.id: n.metadata.get("strain", "") for n in B11.nodes()},
    time_scale="rank",
    mutation_labels={
        m.id: f"{s.position:g}: {s.ancestral_state}→{m.derived_state}"
        for s in B11.sites()
        for m in s.mutations
        if s.position==9389}
)

image

hyanwong commented 1 year ago

Also worth investigating is the placement of Kuwait/JAH3090859/2021 which is an outgroup to the other Alpha variants in the Nextstrain tree, but is nested quite deeply within the alpha cluster in all the sc2ts trees.