ktmeaton / ncov-recombinant

Reproducible workflow for SARS-CoV-2 recombinant sequence detection.
MIT License
18 stars 2 forks source link

Ratio of parental alleles to intermissions #195

Closed ktmeaton closed 1 year ago

ktmeaton commented 1 year ago

I wonder if the ratio of intermissions to diagnostic alleles could be useful to rule out false positives. The filter could be that there must be fewer intermissions than alleles from the "minor" parent.

In this example, there are 3 alleles that could be oming from a "minor" parent BA.2.3.20 (12310G, 16616C, 17678T). And most strains have 3 intermissions (6979T, 27012C, 27513C).

image

ktmeaton commented 1 year ago

On my first run-through of validation, no positive or negative controls in controls or controls-gisaid fail this filter.

ktmeaton commented 1 year ago

In some more expanded testing, this is helping to remove some delta/delta false positives.

ktmeaton commented 1 year ago

I came across a large number of sequences that came back as highly confident BA.5.2/BA.5.3 recombinants. Except, there is substantial allele conflict (intermissions) in the 3' end of the genome (16935 onwards). I realized that I didn't implement logic to use alleles outside the identified regions.

I think these should be considered intermissions, in the sense that they conflict with the evidence for recombination. Not quite a direct conflict as a mismatched allele in a parental region. But still, they are "noisy".

image

ktmeaton commented 1 year ago

So far, all designated recombinants pass this new logic EXCEPT XAV (Issue #104). Previously, there was the ref allele 21789C that lengthed out the BA.2 section. Now, that is no longer BA.2 diagnostic (maybe BA.2.75 has thrown that off?).

image

However, if we set the populations to BA.2 and BA.5.2, the BA.2 signal is strengthened, but so is the noise slightly.

image

I'm weighing too options:

  1. Tweak modes to have BA.5.2 be a candidate parent.
  2. Set XAV as an auto-pass based on the 3' noise and numerous reversions.
ktmeaton commented 1 year ago

There is an edge case where this will cause false negatives, when there are additional spurious parents reported sc2rf. For example XBL. My proposed solution is to disable the intermission_allele_ratio filter when there were more parents originally than the number of filtered parents.

image