Open ammaraziz opened 9 months ago
In the current version (v0.2.0
), I've hard-coded the genetic alphabet to be: A
, C
, G
, and T
, with -
reserved for deletions and N
representing missing/unknown sites. Any other characters, (ex. IUPAC R
= A
or G
) are treated as unknown. Unknown sites are not treated as a reference base, the coordinate should be ignored/skipped over when performing sequence comparisons.
Do you have any more details you could share about the mixed results? I can try and do some troubleshooting, or plan for some better handling of ambiguous characters.
Hi @ktmeaton
Thanks for the information and apologies for the slow response. I had a closer look at the sample in question, it was purposefully designed to be difficult. It was modelling a co-infection of the two ancestral lineages of a known recombinant.
I think your rebar
treating mixed bases as the reference is a good idea. It unfortunately throws a spanner in the works for us as we use the epi2me-labs/wf-artic pipeline which replaces mixed bases as N (as far as I know).
Do you have any advice on how to separate mixed infections from a true recombinant? Assuming the pipeline calls ambiguous bases, a true recombinant would have little to no ambiguous bases but a co-infection would have ambiguous bases. Is my thinking correct here?
Thanks again,
Ammar
To differentiate between co-infection and recombination in sars-cov-2
, I'm heavily relying on the size and composition of the recombination segments. I've also only tested samples that have <=10%
ambiguity across the whole genome so far. Are you noticing a pattern in your samples that have high levels of ambiguity? I'm curious if that causes more false negatives or false positives.
Here's an overview of my current filters to help differentiate recombination from co-infection:
I'm expecting to see discrete segments of a sufficient length to identify recombination. For sars-cov-2
, I've landed on a minimum length of 500
nucleotides, at least 1
substitution (different from the reference), and at least 3
consecutive sites that are informative/differentiating:
These parameters are controlled by the following global parameters in rebar run
:
-l, --min-length <MIN_LENGTH>
Minimum length of a parental region
[default: 500]
-s, --min-subs <MIN_SUBS>
Minimum number of substitutions in a parental region
[default: 1]
-c, --min-consecutive <MIN_CONSECUTIVE>
Minimum number of consecutive bases in a parental region
[default: 3]
However, we know that several known recombinants violate these "rules" (ex. XP
), and so they're handled as known "edge cases".
The known edge cases are found in edge_cases.json
inside the dataset directory from rebar dataset download
. These settings will override the global parameters only for samples that are assigned to the corresponding population (ex. samples assigned to XP
):
{
"population": "XP",
"parents": [
"BA.1.1",
"BA.2"
],
"knockout": null,
"mask": [
100,
200
],
"max_iter": 3,
"max_parents": 2,
"min_parents": 2,
"min_consecutive": 1, <--
"min_length": 1, <--
"min_subs": 1, <--
"naive": false
},
I'm curious if freyja
might also help here? In a co-infection of BA.1
and BA.2
, I'm hoping that:
freyja demix
would report multiple lineages (BA.1
and BA.2
) and their relative abundances.rebar run
would report whichever lineage had the most mutations captured in sequencing ( BA.1
). And then the substitutions coming from the other populations would be listed as private
in the substitutions
column in the linelist.tsv
.Unless there is an amplicon bias towards a minor population, which could produce a long enough segment of informative mutations to look like recombination... 🤔
Hi Katherine,
Thank you rebar, it's working very nicely with a in-silico dataset (part of a quality assurance program in Australia).
I am getting some mixed results when I come include or exclude (eg majority base) ambiguous characters. How does rebar handle ambiguous bases?
Thanks!