Open martinghunt opened 2 years ago
I've added an additional bounds check in self_qc to the read position vs. the start of the left primer and the end of the right primer that should fix this as long as the adapter sequence doesn't extend too far beyond the primer match distance threshold (otherwise we have the unfortunate such as in issue #99)
Can you point me to some fastqs where this appears? I'd like to test the solution
Example is the same as in https://github.com/iqbal-lab-org/viridian_workflow/issues/99. Please also see https://github.com/iqbal-lab-org/viridian_workflow/issues/100 for more detail on the amplicon in question.
@iqbal-lab I understand that there are a number of real world samples where this has produced artefacts, would it be possible for me to see them?
Sure, can you confirm first that this works on the simulated data?
Yes, I am happy about the bounds checking fix for this one example.
Since this is a synthetic test case that deviates from our QC model of real-world data in a relevant way (many nanopore reads of exactly matching lengths and adapter noise) I would like to confirm that the fix properly satisfies the real-world conditions.
Anyway, here's a sample ERR8959196 from the oxford validation set which has loads of adapters.
Thanks for the sample.
The instances of this artefact seem a bit marginal to me:
Primer, Base position, count of occurrences
...
(Primer(name='SARS-CoV-2_9_RIGHT', seq='CACAGGCGAACTCATTTACTTCTG', left=False, forward=False, ref_start=2861, ref_end=2884), 2886) 2
(Primer(name='SARS-CoV-2_9_RIGHT', seq='CACAGGCGAACTCATTTACTTCTG', left=False, forward=False, ref_start=2861, ref_end=2884), 2887) 1
(Primer(name='SARS-CoV-2_9_RIGHT', seq='CACAGGCGAACTCATTTACTTCTG', left=False, forward=False, ref_start=2861, ref_end=2884), 2888) 1
(Primer(name='SARS-CoV-2_56_RIGHT', seq='TGACTCTTACCAGTACCAGGTGG', left=False, forward=False, ref_start=17082, ref_end=17104), 17105) 13
(Primer(name='SARS-CoV-2_56_RIGHT', seq='TGACTCTTACCAGTACCAGGTGG', left=False, forward=False, ref_start=17082, ref_end=17104), 17106) 1
(Primer(name='SARS-CoV-2_86_RIGHT', seq='TCAATTGAGTTGAGTACAGCTGGT', left=False, forward=False, ref_start=26026, ref_end=26049), 26050) 55
(Primer(name='SARS-CoV-2_86_RIGHT', seq='TCAATTGAGTTGAGTACAGCTGGT', left=False, forward=False, ref_start=26026, ref_end=26049), 26051) 2
...
The last column is the count of bases attributed to this artefact. I do not see a difference in any calls made in the final VCF. Do you have a list of the positions where you have observed this artefact?
- Well, we're about to dive in and dissect 12000 S African samples via pileups and the TSV to check everything works, so this is definitely going to be tested on real world data as soon as it is committed.
No, I am sorry but I do not feel comfortable with committing untested code. In my experience this can lead to confusion where it can be tricky to separate issues raised by experimental results from true-positive bugs.
- i dont think this issue is fundamentally about many reads exactly matching lengths, is about not labelling adapters as true sequence, so i think the fix you've implemented does need to go in.
We have a quality control model that is designed to help with adapter artefacts but it does not expect to see a lot of nanopore reads that have the same exact lengths. One way to avoid speculation about this stuff is to look at some examples.
Is it the case that we have observed instances of this issue in real world data? If not I would be happy to move the more open-ended data exploration to a different venue.
OK this issue is on hold until Friday
We need to handle adapter sequences.
Observed behaviour:
Desired behaviour: the adapter sequence is masked. I think self-qc could do this. An alternative would be to remove adapters from the reads earlier on in the pipeline so they are never seen again.
Example is the same as in #99. Please also see #100 for more detail on the amplicon in question.
The adapter sequence is included in the consensus at the end of amplicon
SARS-CoV-2_76
. All the reads there end either with the primer, or with the primer plus a few bp of adapter. None of them should contribute toClean.Tot.cons
at the positions of the primer or past the end of the primer.First columns of
all_stats.tsv
at the end of the primer -- start of adapter -- start of dropped amplicon: