pachterlab / splitcode

Flexible and efficient parsing, interpreting and editing of sequencing reads
https://pachterlab.github.io/splitcode/
BSD 2-Clause "Simplified" License
41 stars 2 forks source link

Feature Request: Keeping the FASTQ quality scores #18

Open JohnMMa opened 4 months ago

JohnMMa commented 4 months ago

I understand splitcode has many usages, and some of them involves substituting one sequence with another, such as --assign or sub, will, in a manner of speaking, make the corresponding quality scores irrelevant, since those sequences do not come out of the sequencer.

However, I have a feeling that splitcode changes the quality code of all its output bases to K, including in places where there's no substitution. For example, in the following config file:

@extract <barcode{{a14}}>,<barcode{{bead1}}>,<barcode{{bead2}}>,<barcode{{bead3}}>

tags    ids     groups  locations       distances       previous        next    minFindsG       maxFindsG       exclude
a14_plain.txt$  a14     a14     0:0:8   0       -       {{bead1}}       1       1       0
struct/newBeads/bc1.txt$        bead1   bead1   0:8:18  0       {{a14}} {{bead2}}       1       1       0
struct/newBeads/bc2.txt$        bead2   bead2   0:18:28 0       {{bead1}}       {{bead3}}       1       1       0
struct/newBeads/bc3.txt$        bead3   bead3   0:28:38 0       {{bead2}}       -       1       1       0

There's no substitution involved, just extraction. Yet, splitcode will recode all phred scores in the output to K, i.e from

@A00563:449:HW2C7DMXY:1:1101:17345:1000:ATACATGA
AACAGACAGACGCGATATGAGAGTTCCTAATGTGAGCAATACATGA
+
??????????????????????????????????????????????

to

@A00563:449:HW2C7DMXY:1:1101:17345:1000:ATACATGA
AACAGACAGACGCGATATGAGAGTTCCTAATGTGAGCA
+
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK

This makes it hard for us to calculate certain QC metrics such as Q30% in barcode. Is it possible to keep the original file's phred scores if substitution is not involved?

Yenaled commented 4 months ago

Ooh yeah, extraction automatically resets the quality scores. I'll try to have it keep the quality scores in the next release if possible.