jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Classifying recombination nodes #95

Open hyanwong opened 1 year ago

hyanwong commented 1 year ago

We know that a number of our recombination nodes are erroneous, and a number are probably correct. In particular, the ones that are designated by an Xyy Pango designation are likely correct.

We should collect a number of metrics on nodes, and possibly run something like a PCA, marking the known "Xyy" points on the plot, to see if the known correct recombinants cluster.

This issue is to track what metrics we could use. Here are some ideas:

Others? I think @a-ignatieva volunteered to help with this too.

hyanwong commented 1 year ago

Here I classify the recombination nodes by distance between recombinant lineages in terms of number of nodes (d_node), in terms of branch length distance (d_time), in terms of number of mutations on the edges constituting each lineage (d_mut), the number of revertant mutations on the edges to each side of the breakpoint (d_rev), and the number of mutations just associated with that node (mutations - as gleaned from the node metadata).

This is for the small tree sequence, as it is more likely to have decent recombinants. Known Pango recombinants (XB, XA, etc) are in orange. I stuck the mutation row on a log scale: I can do that for the others too if necessary

image

hyanwong commented 1 year ago

Here's a pdf with the X lineages labeled in small bold (zoom in to see them): pairplot.pdf

hyanwong commented 1 year ago

I'm not sure about the allocation of nodes to recombinant Pango lineage origins, however. For instance, this is supposedly XB (found by using treeinfo.pango_recombinant_lineages_report()):

Screenshot 2023-02-10 at 17 29 32

But node 206466 (the one I label XB as a result of the above code) has a huge number of mutations in the HMM:

Report for 206466
[{'Imputed_lineage': 'Recombinant',
  'date_added': '2021-03-05',
  'match_info': [{'breakpoints': [0, 22021, 26736, 29904],
    'direction': 'backward',
    'mutations': ['1255A>G',
     '3688C>T',
     '3884C>T',
     '6633C>T',
     '7142A>G',
     '9614A>G',
     '9693C>T',
     '9754A>C',
     '15026C>T',
     '15451G>A',
     '16466C>T',
     '21615T>G',
     '21691C>T',
     '21846C>T',
     '22036A>C',
     '23638C>T',
     '25810T>C',
     '27769C>T',
     '28045C>T',
     '28048G>A',
     '28330A>G',
     '28910A>T'],
    'num_mismatches': 3.0,
    'parents': [10524, 200603, 40628],
    'strain': 'USA/TX-TCH-TCMC03017/2021'},
   {'breakpoints': [0, 23604, 27389, 29904],
    'direction': 'forward',
    'mutations': ['1255A>G',
     '3884C>T',
     '6633C>T',
     '7142A>G',
     '9614A>G',
     '9693C>T',
     '9754A>C',
     '15026C>T',
     '15451G>A',
     '16466C>T',
     '21057C>T',
     '21615T>G',
     '21691C>T',
     '21846C>T',
     '22036A>C',
     '23638C>T',
     '25810T>C',
     '27769C>T',
     '28045C>T',
     '28048G>A',
     '28330A>G',
     '28910A>T'],
    'num_mismatches': 3.0,
    'parents': [12108, 200603, 40628],
    'strain': 'USA/TX-TCH-TCMC03017/2021'}]}]
Screenshot 2023-02-10 at 17 31 58 Screenshot 2023-02-10 at 17 32 09
hyanwong commented 1 year ago

Here's number of reversions in the children (x) against (log) number of mutations in the recombinant. I have identified number of revertants in the children by looking at the mutations associated with the edges down to the each of the children of the recombinant.

Screenshot 2023-02-10 at 18 09 38

It looks like Pango XAK (mapped to node 555388 in the small TS) is a bit of an outlier here, with many revertants in the children. Again, I think this is probably a poorly identified recombinant, that treeinfo.pango_recombinant_lineages_report() has misidentified as (one of) the XAK original recombinants.

Report for 555388
[{'Imputed_lineage': 'Recombinant',
  'date_added': '2021-12-14',
  'match_info': [{'breakpoints': [0, 22674, 26060, 29904],
    'direction': 'backward',
    'mutations': ['832T>C',
     '2790C>T',
     '9344C>T',
     '9424A>G',
     '9866C>T',
     '10198C>T',
     '11124T>C',
     '11235T>C',
     '17410C>T',
     '19955C>T',
     '20055A>G',
     '21618C>T',
     '21846T>C',
     '22200T>G',
     '22688A>G',
     '22775G>A',
     '22786A>C',
     '22792C>T',
     '22898A>G',
     '23048A>G',
     '23202A>C',
     '24130A>C',
     '25624C>T',
     '27382G>C',
     '27383A>T',
     '27384T>C'],
    'num_mismatches': 3.0,
    'parents': [544948, 551092, 544948],
    'strain': 'Denmark/DCGC-281594/2021'},
   {'breakpoints': [0, 22674, 26060, 29904],
    'direction': 'forward',
    'mutations': ['832T>C',
     '2790C>T',
     '9344C>T',
     '9424A>G',
     '9866C>T',
     '10198C>T',
     '11124T>C',
     '11235T>C',
     '17410C>T',
     '19955C>T',
     '20055A>G',
     '21618C>T',
     '21846T>C',
     '22200T>G',
     '22688A>G',
     '22775G>A',
     '22786A>C',
     '22792C>T',
     '22898A>G',
     '23048A>G',
     '23202A>C',
     '24130A>C',
     '25624C>T',
     '27382G>C',
     '27383A>T',
     '27384T>C'],
    'num_mismatches': 3.0,
    'parents': [544948, 551092, 544948],
    'strain': 'Denmark/DCGC-281594/2021'}]}]
Screenshot 2023-02-10 at 18 14 51
jeromekelleher commented 1 year ago

The pango_recombinant_lineages_report is definitely doing a bad job here as it was written implicitly assuming that there would be a single origin for the pango X lineages. However, it's much more complicated than that. I think we need to cluster the X-lineage samples by their closest recombinant and then reason about those. But, that's not terribly simple either because it's still quite messy

hyanwong commented 1 year ago

So here’s my best rough triage for believable recombinants: I have classified them into good/bad with good having

In the plot below, green points are good + pango-labelled. Orange are good + non-Pango-labelled, red are bad + Pango labelled. Red indicates the ones that my basic classifier misses the ones that are single origin are labelled with an X- designation. I'm probably being overly strict here, and ruling out some fairly reliable recombinants.

pairplot.pdf

But this metric, these are the recombinant nodes identified as putative "good" recombinants in the small tree sequence.

685963, 690263, 213104, 730277, 623069, 746426, 716784, 660632,
662363, 661450, 657300, 659577, 652247, 758580, 743460, 704521,
749143, 703379, 711560, 642711, 705501, 738307, 640831, 674332,
677977, 679355, 674229, 733196, 687046, 630235, 666230, 633497,
582054, 609881, 729195, 663092, 635896, 752475, 625355, 696838,
638141, 638733, 639472, 663539, 677584

Here are the IDs that haven't previously been identified as a Pango X- recombinant, but might be according to my metric:

685963, 690263, 213104, 730277, 623069, 746426, 716784,
660632, 662363, 661450, 657300, 659577, 652247, 758580, 743460
hyanwong commented 1 year ago

659577 is the most convincing (to me) of the non-Pango recombinants that I've seen according to my criterion:

Screenshot 2023-02-11 at 00 21 58 Screenshot 2023-02-11 at 00 21 47
a-ignatieva commented 1 year ago

That one looks slightly messy in that it's a parent of another recombinant node (which has an XZ descendant)?

image
jeromekelleher commented 1 year ago

Big ole recombination hairball on the left path to root for this one:

657300

jeromekelleher commented 1 year ago

This is quite hard I think - perhaps it would be better to focus on the big trees from 2021? The trees are better, and there's less time for genuine recombination hairballs to occur. True we have fewer true positives to check against, but that's OK?