torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

Why FASTQ quality value above qmax is treated as Fatal error? #522

Open billzt opened 1 year ago

billzt commented 1 year ago

When I used --fastx_filter to trim FASTQ files (seeming like a very easy task),

$ vsearch --fastx_filter in.fastq --fastq_stripleft 20 --fastq_stripright 20  --fastaout out.fa

I encountered Fatal error as

Fatal error: FASTQ quality value (75) above qmax (41)
By default, quality values range from 0 to 41.
To allow higher quality values, please use the option --fastq_qmax 75

So I added --fastq_qmax 75. However another Fatal error occurred

Fatal error: FASTQ quality value (76) above qmax (75)
By default, quality values range from 0 to 41.
To allow higher quality values, please use the option --fastq_qmax 76

An example is https://www.ncbi.nlm.nih.gov/sra/SRR19317902, where its reads look like

@SRR19317902.1 M05585:4:000000000-BPG86:1:1101:8340:1726_CONS_SUB_SUB_CMP length=313
TTTATCTTCATCTATTGGAGCTCCAGGTCCTGCAGTTGACTTGGCTATTTTTTCACTTCACATAGCTGGTGCATCTTCAATTTTAGGTGCTATTAACTTTATCACTACTATATATAATATGCGTGCCCCTGGAATGACACTTCACAAAATGCCTTTGTTTGTTTGGTCCATTTTAGTAACGGCTTTTTTATTACTACTATCCTTACCAGTATTAGCTGGTGCAATCACTATGCTTTTAACCGATAGAAACTTTGGTACAGCTTTTTTTGCAGCAGAAGCTGGAGGTGATCCAGTACTTTTCCAACATCTATTT
+SRR19317902.1 M05585:4:000000000-BPG86:1:1101:8340:1726_CONS_SUB_SUB_CMP length=313
HGIHIHHIIIIIIIHIIHIIIFHIHI:GHFFGHIIHE.EEHEHHBHHHIHHIII>BHHIEEEGIHIHIIIIIIIHHI;HIIIIIHIIIHHIH;GHCGHIlef\abmllj_hQ`ddgX\h]ghigddkkalkab]lfgkcg`lmllmlhi`bYi]imlljblilg^Yll^cklkki^lkmliifhhjmmmmmlmli^^_hmd]hlmlllmlllgIIIHHFIIGHHDCHHGIIIEGCIEHEIIIIIIIIIIIIHIHEIIIIIIIIIHIHHIIIIIHIIIHIIIIIIIIIHIHEIIHHHB>HHHIIHIIHHHIIIH
@SRR19317902.2 M05585:4:000000000-BPG86:1:1101:19814:1776_CONS_SUB_SUB_CMP length=313
ATTATCATCTAAAACCGGACAACCTGATCCAGCAATGCATTTAGCTATTCTGGCGCTACACGTAGCTGGAGCTTCTTCAATTTTAGGACCGATAAATTTTATAACTACAATTCTAAACATGAGAACTCCTGGTATGACACTACATAAGATGCCACTATTTTCATGGTCAATTTTAGTAACTGCGTTTTTACTTCTATTATCTTTACCAGTTTTAGCGGGAGCAATTACAATGCTGCTCACAGACAGAAATTTTGGAACAGCTTTTTTTGAAGCTCAAACAGGAGGAGACCCAGTTTTGTTTCAACATTTATTT
+SRR19317902.2 M05585:4:000000000-BPG86:1:1101:19814:1776_CONS_SUB_SUB_CMP length=313
:G;HEH.=8E=8E>8.9BHEB8..E...EB.8.GEH..E.8BH.8.;E>G..;.-:?-9G><.C6HH>.;D>.D.C;H...E>H...E.;6-B-6.7.9b]Naha^RGfRPb^j]GhgllDKCQUIe\kmcZlll]EKj_`QDedli[E[KHXN,UQ^ajklQbVPcdeGfQXQMUQc:[@4S`EH^k\OPcg[OUjgEcFagjFXEVRQ?MS-@9HHAH;IIIHHGHAHE.HH?E.EE.E.;.>>H?.H.AH>A>IHIIHEH>HCH;E<8..IHEG8EIIIGGE8E..:IEHG8EIIIIIGHIHE>HHEGEH
@SRR19317902.3 M05585:4:000000000-BPG86:1:1101:10609:1789_CONS_SUB_SUB_CMP length=313
TCTAAGTCATATTACGAGCCACTCTGGCGGTGAAGATGATTTAGCTATTTTCAGTTTACATTTATCTGGAGCGAGCAGTATTTTAGGGGCGATTAACTTTATTACCACTATCTTTAACATGCGTGGGCCTGGTCTGGGCTTCCATCGCTTACCTTTATTTGTGTGGTCTGTTCTGATTACGGCCTTCTTACTTCTTCTCTCTTTACCGGTTTTAGCGGGAGCGATTACGATGCTGTTAACGGATCGTAACTTCAATACCTCGTTCTTTGATCCGGCGGGCGGAGGTGACCCTATTTTGTGCCAGCACCTTTTT
+SRR19317902.3 M05585:4:000000000-BPG86:1:1101:10609:1789_CONS_SUB_SUB_CMP length=313
-BGHECE>;;GH;.GB.B--..8>.E>.----.EH.E....E8>E88;.EG;E>EE.G>HG;8EAACA..;.:BH9-D..7>>7EH:FE<<-B>8.<G.f[kkillh[lRR`NN]g[PgfljmbjijZbQcMQZ;>K`PLLaP\I_jcNOS]<]G7OORRZ[]VROMQbkgia]eik]\IaZhVjcdaQ`jhiY^ciaklibdkRehkXYXL^IEEH9IHIF>E>HIGHH:HIHHFGIH.-HEH<HHGHCHHIH;D6??EIIEBH.IHBH<E9<EGGH<E>GH8H>HHCGCEGB8>B>E.HE8:>EB9EGIHH

Above all, what I wondered is why FASTQ quality value above qmax is treated as Fatal error? I had thought that any quality values larger than 41 would be cut off to 41.

billzt commented 1 year ago

In addition, I tried to use --fastq_chars and it gave the following results

Reading FASTQ file 100%  
Read 54687 sequences.
Qmin 42, QMax 109, Range 68
Guess: -fastq_qmin 9 -fastq_qmax 76 -fastq_ascii 33
Guess: Illumina 1.8+ format (phred+33)

Therefore, is it the best practice to use --fastq_chars at the first step to decide -fastq_qmin, -fastq_qmax and -fastq_ascii, and then continue with the downstream analysis?

frederic-mahe commented 1 year ago

Above all, what I wondered is why FASTQ quality value above qmax is treated as Fatal error?

Reads with quality values encoded on more than 41 levels are relatively recent. Observing values outside the 0-41 range used to be the sign that something went wrong. Nowadays, maybe 0-93 should be the new default range for quality values?

Therefore, is it the best practice to use --fastq_chars at the first step to decide -fastq_qmin, -fastq_qmax and -fastq_ascii, and then continue with the downstream analysis?

Yes, finding the qmax value beforehand could avoid a fatal error downstream. One possibility is to capture the value in a variable:

QMAX=$(printf "@s1\nA\n+\nI\n" | \
       vsearch --fastq_chars - 2>&1 | \
       grep -o -P "fastq_qmax *\K[[:digit:]]+")

Here QMAX would contain the value 40.

frederic-mahe commented 1 year ago

maybe 0-93 should be the new default range for quality values?

@torognes what do you think of that?

torognes commented 1 year ago

I agree that the world has moved forward and that some FASTQ files now contain q values above 41.

I also agree that the intention of this option was to help users detect old FASTQ files (phred 64), which is hardly a problem anymore.

Setting default qmax to 93 would essentially remove its effect, as it is the highest possible quality value obtainable with any printable character.

I am a bit sceptical to change default values and behaviour like this without a major version change, since it could potentially change results. But the change would only be to allow some files to be analysed which would previously fail.

Compatibility with usearch is another issue. I haven't check what it does.

Another possibility is to issue a warning instead of a fatal error. Perhaps a warning at the first encounter and then another one after reading the whole file with the maximum q value.

billzt commented 12 months ago

@torognes I agree with issuing a warning instead of a fatal error. Fatal error is too nervous

goodcouscous commented 2 weeks ago

Hello i also encountered this problem like when i wanted to convert my q20 fastq files to fasta files. with this input vsearch --fastq_filter BAR_trimmed_max950_q20.fastq --fastq_qmax 99 -fastaout BAR_max920_q20_qmax99.fasta i got this output Sum of arguments to --fastq_ascii and --fastq_qmax must be no more than 126

i then changed --fastq_max to 90

what actually happens to the reads when i filter for a quality score of q10 again after previously filtering them for a quality score of q20?