[BUG report] Quality trimming is not working when the last base is high quality.

marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads

https://cutadapt.readthedocs.io

MIT License

523 stars 130 forks source link

[BUG report] Quality trimming is not working when the last base is high quality. #676

Open y9c opened 1 year ago

y9c commented 1 year ago

This is a real example from a Nextseq sequencing run. The 3' end of this read is extremely low, and the read should be trimmed.

@test
CTTGCTCCCCTTCCTCACCGCATTCCCTAAAGCATCGCCGCCCCCGTTCGGTAAGTTTTGGGGTTTAGTAACCCCCTAAA
+
CCCCCCCCCCCCCCCCCCCCCC-CCC-;C------C---C--C-;C----C-C---;-C-----C-C--;C-C-C----C

However, when running cutadapt with -q 30 argument, the read remains unchanged.

$ cutadapt -q 30 test.fq 2>/dev/null

@test
CTTGCTCCCCTTCCTCACCGCATTCCCTAAAGCATCGCCGCCCCCGTTCGGTAAGTTTTGGGGTTTAGTAACCCCCTAAA
+
CCCCCCCCCCCCCCCCCCCCCC-CCC-;C------C---C--C-;C----C-C---;-C-----C-C--;C-C-C----C

But when the last one base is cut before quality trimming, the result is correct.

$ cutadapt -u -1 -q 30 test.fq 2>/dev/null

@test
CTTGCTCCCCTTCCTCACCGCA
+
CCCCCCCCCCCCCCCCCCCCCC

Note: The nextseq 2000 machine use RTA3 tools, and the quality bases are one of C, ; or -.

marcelm commented 1 year ago

Maybe it’s time to implement an alternative quality trimming algorithm in Cutadapt.

The current algorithm is just a reimplementation of the algorithm in BWA and shouldn’t be changed as that would be a backwards incompatible change. Possibly a very small modification that ignores the very last quality value or so would be ok, though, but this would need to be tested.

A new quality trimming option would be much less problematic. I don’t know if there’s anything more recent, but a quick search found this paper that describes some quality trimming algorithms, some of them should be easy to implement. Do you think one of them would work for your case?

y9c commented 1 year ago

Yes. A new algorithm would be a better solution for this. The new Illumina sequencing platforms (Nextseq 2000, NovaSeq) use compressed encoding of the quality score, and the running sum method in cutadapt might not be fully accurate for the new sequencing data.

y9c commented 1 year ago

Thank you for sharing the paper. It seems that window-based methods are better than running sum based methods? Is the algorithm used in SolexaQA the best? But the curve in Fig2 looks weird, and it might not be easy to find the best Q cutoff.

marcelm commented 1 year ago

I haven’t looked at the window-based methods, but the ERNE-FILTER algorithm from the linked paper seems to work on your example. Here’s a simplified version that doesn’t trim the 5' end:

def qual_trim_index(qualities, threshold):
    score = 0
    best_end = 0
    best_score = 0
    for i, q in enumerate(qualities):
        score += q - threshold
        if score > best_score:
            best_end = i + 1
            best_score = score

    return best_end

before = "CCCCCCCCCCCCCCCCCCCCCC-CCC-;C------C---C--C-;C----C-C---;-C-----C-C--;C-C-C----C"
qualities = [ord(c) - 33 for c in before]
end = qual_trim_index(qualities, 30)
after = before[:end]
print("before:", before)
print("after: ", after)

Result:

before: CCCCCCCCCCCCCCCCCCCCCC-CCC-;C------C---C--C-;C----C-C---;-C-----C-C--;C-C-C----C
after:  CCCCCCCCCCCCCCCCCCCCCC

If I change the threshold to 27, it retains four additional qualities, which I think is desired.

(Quality values: C: 34, ;: 26, -: 12)

Here’s a small command-line script in case you want to test it. It appears to be more aggressive than the current quality trimming method.