How does rebar handle IUPAC ambiguous characters?

ammaraziz commented 9 months ago

Hi Katherine,

Thank you rebar, it's working very nicely with a in-silico dataset (part of a quality assurance program in Australia).

I am getting some mixed results when I come include or exclude (eg majority base) ambiguous characters. How does rebar handle ambiguous bases?

Thanks!

ktmeaton commented 9 months ago

In the current version (v0.2.0), I've hard-coded the genetic alphabet to be: A, C, G, and T, with - reserved for deletions and N representing missing/unknown sites. Any other characters, (ex. IUPAC R = A or G) are treated as unknown. Unknown sites are not treated as a reference base, the coordinate should be ignored/skipped over when performing sequence comparisons.

Do you have any more details you could share about the mixed results? I can try and do some troubleshooting, or plan for some better handling of ambiguous characters.

ammaraziz commented 8 months ago

Hi @ktmeaton

Thanks for the information and apologies for the slow response. I had a closer look at the sample in question, it was purposefully designed to be difficult. It was modelling a co-infection of the two ancestral lineages of a known recombinant.

I think your rebar treating mixed bases as the reference is a good idea. It unfortunately throws a spanner in the works for us as we use the epi2me-labs/wf-artic pipeline which replaces mixed bases as N (as far as I know).

Do you have any advice on how to separate mixed infections from a true recombinant? Assuming the pipeline calls ambiguous bases, a true recombinant would have little to no ambiguous bases but a co-infection would have ambiguous bases. Is my thinking correct here?

Thanks again,

Ammar

ktmeaton commented 8 months ago

To differentiate between co-infection and recombination in sars-cov-2, I'm heavily relying on the size and composition of the recombination segments. I've also only tested samples that have <=10% ambiguity across the whole genome so far. Are you noticing a pattern in your samples that have high levels of ambiguity? I'm curious if that causes more false negatives or false positives.

Here's an overview of my current filters to help differentiate recombination from co-infection:

I'm expecting to see discrete segments of a sufficient length to identify recombination. For sars-cov-2, I've landed on a minimum length of 500 nucleotides, at least 1 substitution (different from the reference), and at least 3 consecutive sites that are informative/differentiating:

These parameters are controlled by the following global parameters in rebar run:

  -l, --min-length <MIN_LENGTH>
          Minimum length of a parental region
          [default: 500]

  -s, --min-subs <MIN_SUBS>
          Minimum number of substitutions in a parental region
          [default: 1]

  -c, --min-consecutive <MIN_CONSECUTIVE>
          Minimum number of consecutive bases in a parental region
          [default: 3]

However, we know that several known recombinants violate these "rules" (ex. XP), and so they're handled as known "edge cases".

XP_BA 1 1_BA 2_27385-29509

The known edge cases are found in edge_cases.json inside the dataset directory from rebar dataset download. These settings will override the global parameters only for samples that are assigned to the corresponding population (ex. samples assigned to XP):

  {
    "population": "XP",
    "parents": [
      "BA.1.1",
      "BA.2"
    ],
    "knockout": null,
    "mask": [
      100,
      200
    ],
    "max_iter": 3,
    "max_parents": 2,
    "min_parents": 2,
    "min_consecutive": 1, <--
    "min_length": 1, <--
    "min_subs": 1, <--
    "naive": false
  },

ktmeaton commented 8 months ago

I'm curious if freyja might also help here? In a co-infection of BA.1 and BA.2, I'm hoping that:

freyja demix would report multiple lineages (BA.1 and BA.2) and their relative abundances.
rebar run would report whichever lineage had the most mutations captured in sequencing ( BA.1). And then the substitutions coming from the other populations would be listed as private in the substitutions column in the linelist.tsv.

Unless there is an amplicon bias towards a minor population, which could produce a long enough segment of informative mutations to look like recombination... 🤔

phac-nml / rebar

How does rebar handle IUPAC ambiguous characters? #20