Open y9c opened 1 year ago
Maybe it’s time to implement an alternative quality trimming algorithm in Cutadapt.
The current algorithm is just a reimplementation of the algorithm in BWA and shouldn’t be changed as that would be a backwards incompatible change. Possibly a very small modification that ignores the very last quality value or so would be ok, though, but this would need to be tested.
A new quality trimming option would be much less problematic. I don’t know if there’s anything more recent, but a quick search found this paper that describes some quality trimming algorithms, some of them should be easy to implement. Do you think one of them would work for your case?
Yes. A new algorithm would be a better solution for this. The new Illumina sequencing platforms (Nextseq 2000, NovaSeq) use compressed encoding of the quality score, and the running sum method in cutadapt might not be fully accurate for the new sequencing data.
Thank you for sharing the paper. It seems that window-based methods are better than running sum based methods? Is the algorithm used in SolexaQA the best? But the curve in Fig2 looks weird, and it might not be easy to find the best Q cutoff.
I haven’t looked at the window-based methods, but the ERNE-FILTER algorithm from the linked paper seems to work on your example. Here’s a simplified version that doesn’t trim the 5' end:
def qual_trim_index(qualities, threshold):
score = 0
best_end = 0
best_score = 0
for i, q in enumerate(qualities):
score += q - threshold
if score > best_score:
best_end = i + 1
best_score = score
return best_end
before = "CCCCCCCCCCCCCCCCCCCCCC-CCC-;C------C---C--C-;C----C-C---;-C-----C-C--;C-C-C----C"
qualities = [ord(c) - 33 for c in before]
end = qual_trim_index(qualities, 30)
after = before[:end]
print("before:", before)
print("after: ", after)
Result:
before: CCCCCCCCCCCCCCCCCCCCCC-CCC-;C------C---C--C-;C----C-C---;-C-----C-C--;C-C-C----C
after: CCCCCCCCCCCCCCCCCCCCCC
If I change the threshold to 27, it retains four additional qualities, which I think is desired.
(Quality values: C
: 34, ;
: 26, -
: 12)
Here’s a small command-line script in case you want to test it. It appears to be more aggressive than the current quality trimming method.
This is a real example from a Nextseq sequencing run. The 3' end of this read is extremely low, and the read should be trimmed.
However, when running cutadapt with
-q 30
argument, the read remains unchanged.But when the last one base is cut before quality trimming, the result is correct.
Note: The nextseq 2000 machine use RTA3 tools, and the quality bases are one of
C
,;
or-
.