Open martinghunt opened 2 years ago
The logic for identifying primers before cylon is run is different than the later primer id logic that we use to mask primer positions in the consensus.
Unfortunately, if a fragment ends between two alternative primers and the adapter sequence happens to be similar enough to the reference not to be soft-clipped we have no way to be sure if it's legitimately a fragment that ends in that region, or a fragment with adapter noise that ends there.
Luckily we have a couple advantages over the data here:
In summary, there are going to be annoying sequencing artefacts that we need to catch so we have to use additional polishing and QC filters.
Simulated test cases are very helpful for identifying these kinds of artefacts and testing at the component level. Since there isn't much that we can do in this situation, the best we can do is tweak our QC model to reduce the impact.
To adjust the QC thresholds we should make sure that we look at real world examples of this. Do we have a way to identify this problem in real data?
This issue is solely about how reads are allocated to amplicons/primers, and then deciding which primers to reject. It's not about QC.
The reads unambiguously are from SARS-CoV-2_76_RIGHT
based on read mapping. The reads are mapped with the end coordinate being 23057. samtools depth:
MN908947 23054 336
MN908947 23055 325
MN908947 23056 304
MN908947 23057 288
MN908947 23058 2
MN908947 23059 1
MN908947 23060 1
MN908947 23061 1
MN908947 23062 0
MN908947 23063 0
The depth of the remainder of the amplicon, and indeed past it, is zero. Therefore SARS-CoV-2_76_RIGHT_alt1
is not present.
The only noise here is the depth of 2, 1, 1, 1 at positions 23058-23061.
The expected behaviour is that primer SARS-CoV-2_76_RIGHT_alt1
is excluded.
Before running cylon we detect which primers are present for each amplicon. We do this because we want to include cases where both alternate primers are present in the sequencing data. We construct a histogram of observed primers for each amplicon and use this to decide which ones are actually there.
Unfortunately, it appears that it is possible for adapter sequences to trick this stage.
Luckily, we have downstream QC and tighter primer id to compensate for these cases.
Firstly, the code should work on these simulated reads and exclude primer SARS-CoV-2_76_RIGHT_alt1
. The read mapping ends exactly at the end of primer SARS-CoV-2_76_RIGHT
(except those few erroneous depths of 2,1,1,1, which should not be enough to trick into thinking there's another primer downstream of there).
Secondly, there's a minimal example of simulated nanopore reads attached that reproduces this error. Perfect reads from amplicon 75 and from amplicon 76. All reads span exactly from the start of a left primer to end of a right primer. Reads do not have any adapter and they are literally identical to the reference genome. They include each primer to its end, and have zero extra bases past the primer ends. The reads are at 200X, with 100X from each strand.
Amplicon 75 reads are from primer SARS-CoV-2_75_LEFT
to primer SARS-CoV-2_75_RIGHT
.
Amplicon 76 reads are from primer SARS-CoV-2_76_LEFT
to primer SARS-CoV-2_76_RIGHT
.
Expected behaviour: the primer SARS-CoV-2_76_RIGHT_alt1
(coords 23121-23141) is excluded.
Observed behaviour: the amplicon coords given to cylon are:
"SARS-CoV-2_76": {
"start": 22648,
"end": 23140,
"left_primer_end": 22773,
"right_primer_start": 23028
},
which means that the primer SARS-CoV-2_76_RIGHT_alt1
has not been excluded.
Thank you for the example. Please note that there is a bit more logic behind identifying the target primers: https://github.com/jeff-k/viridian_workflow/blob/6ea9cdbf7660508b41cf0def51b17c3ae4626054/viridian_workflow/readstore.py#L320
In short, we reason about the "primer region" during this first amplicon id pass, that being the interval spanned by min(p1_start, p2_start)
and max(p1_end, p2_end)
when there are alternate primers.
When one of the multiple primers is not observed above the primer region threshold (default 100) it is excluded from this region. When neither of the primers are above this threshold (as in this example), we assume the largest extent of the amplicon. This is to safely accommodate data which may be fragmented.
Please note that these preliminary primer bounds are not related to the consensus masking or qc processes. So the primers are not really "kept".
Is this behaviour causing problems for cylon?
@jeff-k - referring to this comment of yours (my bold)
"When one of the multiple primers is not observed above the primer region threshold (default 100) it is excluded from this region. When neither of the primers are above this threshold (as in this example), we assume the largest extent of the amplicon. "
i'm confused. One of the primers has coverage of 200x and the alt has zero. So, the alt should be excluded. What do you mean neither of the primers is above the threshold of 100?
No, i'm doubly confused.
Whoops, I interpreted the attached file as a fastq based on the extension and only counted half the records.
Either way, yes, for point 2 this code appears to be working as described.
I don't know if this is agreeing that there is a bug or disagreeing that there is bug and that the code does not need changing: "Either way, yes, for point 2 this code appears to be working as described."
The right primer is not being excluded. As described below.
Amplicon SARS-CoV-2_76 has two right primers:
Cylon is given this:
"SARS-CoV-2_76": {
"start": 22648,
"end": 23140,
"left_primer_end": 22773,
"right_primer_start": 23028
},
The end
coordinate given to cylon is the end coordinate of the right primer SARS-CoV-2_76_RIGHT_alt1
(difference off 1 because of 0- or 1-based coordinates).
Expected behaviour is: "end": 23056
, ie the end of the primer SARS-CoV-2_76_RIGHT
.
Months ago it was agreed that this would be part of the method. When there is:
>
1 primer at an amplicon end,then all other alt primers there with no support would be excluded from that point onwards.
OK that answers my question here
https://github.com/iqbal-lab-org/viridian_workflow/issues/99#issuecomment-1275214193
I looked at that json and checked the right primer start coord which seemed to correctly be at 23028. I missed your main point was that the end was at 23140, which is the bug. That end cpord needs to be at the right hand end of 76-right, not 76-right-alt1
So, I withdraw my question 2, and now I see the bug again, sorry for confusion
Ah, I see the confusion. The start
and end
fields refer to the amplicon coordinates, and the left_primer_start
and right_primer_end
fields refer to the observed coverage by primers as specified.
Note that it would be somewhat redundant if all four fields represented the same data (+/- primer lengths).
Since the data required by cylon is available in the fields that are specifically labelled as the primer coords I believe that this issue can be closed.
It can't be closed because:
end
coordinate is incorrect. Cylon requires the end coordinate to be correct.SARS-CoV-2_76_RIGHT_alt1
is counted as present, despite zero evidence of its presence, needs investigating. In both the simulated data and in the perfect reads.I have double checked the primer data source and I can confirm that the right-most bound of the amplicon (that is, considering all possible amplicons) is correctly propagated to the end field.
In addition to this the coordinates of the observed primers are correctly identified in the primer start and end fields. In this case that means excluding the coords of alt1.
This has always been the specified behaviour of these fields and it would not be appropriate to relabel the JSON output that we have been generating already.
ok i'd like to have a conversation about this; i'm not feeling great today, so i suggest the conversation happens on friday. i'd like this to stay open til we have had that conversation.
We have examples where there is a choice of two primers for the end of a given amplicon. Only one primer of the two primers has support from the reads.
Observed behaviour: Both primers are kept, instead of excluding the primer that isn't supported by reads. Cylon is given amplicon coords as if both primers are present (and presumably everything downstream also thinks both primers there) Expected behaviour: The primer not supported by reads is excluded.
Details are below for one sample.
Path on codon:
/nfs/research/zi/mhunt/Viridian/Debug_sims_lily_20220930/Samples/1441/vwf.20220926.ccf5ef4676.ont.frs0.6/
. Current working directory was/nfs/research/zi/mhunt/Viridian/Debug_sims_lily_20220930/Samples/1441/
when running it.Version of viridian workflow: ccf5ef46765ad30cb11e9a3df716efdb2d11a4f9
The primer scheme is ARTIC 4.1. The run is using simulated Nanopore reads.
Amplicon
SARS-CoV-2_76
is present at around 380X depth. The next ampliconSARS-CoV-2_77
is not present - it has zero reads.Amplicon
SARS-CoV-2_76
has two right primers:SARS-CoV-2_76_RIGHT
at coords 23029-23057.SARS-CoV-2_76_RIGHT_alt1
at coords 23121-23141.Read mapping shows that:
SARS-CoV-2_76
all come from the primerSARS-CoV-2_76_RIGHT
.SARS-CoV-2_76_RIGHT_alt1
.Excerpt from
samtools depth -aa
on the BAM made by viridian (after coordinate sorting it):and continues at zero depth until around the start of amplicon 78:
cylon is given this in the file
ampicons.json
forSARS-CoV-2_76
:This means that primers
SARS-CoV-2_76_RIGHT
andSARS-CoV-2_76_RIGHT_alt1
have been counted as present in the reads. Expected behaviour is that onlySARS-CoV-2_76_RIGHT
is present.Screenshot attached showing the mapped reads in this region. Amplicon SARS-CoV-2_76 is highlighted (it also has two left primers. The highlighted part is using the "inner" of the two primers at each end of the amplicon).