Since the alignment does not take into account quality data, it causes some
final sequence errors which would logically be ignored during manual
inspection. All settings were following the default install except minimum
quality which was set to 20 for the purpose of showing Example 2.
Example 1: Insertion errors (insertions.png)
Trace 1 has a high-quality trace which says CC. Trace 2 is just beginning, with
a low quality N added into the sequence. This results in a final base call of
CNC which is clearly not the case.
Example 2: Bayesian poisoning due to misalignments (starting.png)
Trace 1 has a low-quality starting trace, which is misaligned. It has a C with
a quality of 23. The misalignment pairs it with a G with a quality of 28, which
is marked as N due to the disagreement, throwing off the Bayesian base caller.
Previous bases (the A and C) with lower qualities are called correctly.
To suppress errors of the first kind, code might be added to look for
insertions of N within a high-quality (above minimum threshold) and
automatically remove these insertions.
To suppress errors of the second kind, it could be possible to implement a
"trace-trimming" feature, using the same code used to trim the final sequence,
in order to remove misaligned starts and ends of traces.
Sinces traces also suffer from clusters (~10 bases) of low quality data points
from 20-160 bases, they should also be appropriately treated when it comes to
alignments.
The perfect solution would be to have an alignment algorithm which takes
quality into account, but lacking those, these aforementioned things will be
good stopgaps.
Original issue reported on code.google.com by linyiers...@gmail.com on 14 Aug 2014 at 8:33
Original issue reported on code.google.com by
linyiers...@gmail.com
on 14 Aug 2014 at 8:33Attachments: