statgen / fastQValidator

Validate FastQ Files
36 stars 10 forks source link

fastQValidator reports error on seemingly good FastQ files? #3

Closed ghost closed 8 years ago

ghost commented 9 years ago

Observed Behavior:

When running fastQValidator on what I believe is a 'normal' FastQ file generated by "fulcrum_v_043/merger.py" and modified by a "sed" one-liner, I get the following message:

ERROR on Line 85: Repeated Sequence Identifier: D3NT6Q1:316:C4UHWACXX:5:1101:1628:2198 at Lines 81 and 85 ERROR on Line 93: Repeated Sequence Identifier: D3NT6Q1:316:C4UHWACXX:5:1101:1732:2209 at Lines 89 and 93 ERROR on Line 101: Repeated Sequence Identifier: D3NT6Q1:316:C4UHWACXX:5:1101:1708:2231 at Lines 97 and 101 ERROR on Line 109: Repeated Sequence Identifier: D3NT6Q1:316:C4UHWACXX:5:1101:1939:2092 at Lines 105 and 109 ERROR on Line 117: Repeated Sequence Identifier: D3NT6Q1:316:C4UHWACXX:5:1101:1827:2130 at Lines 113 and 117 ERROR on Line 125: Repeated Sequence Identifier: D3NT6Q1:316:C4UHWACXX:5:1101:1858:2140 at Lines 121 and 125 ERROR on Line 133: Repeated Sequence Identifier: D3NT6Q1:316:C4UHWACXX:5:1101:1950:2161 at Lines 129 and 133 ERROR on Line 141: Repeated Sequence Identifier: D3NT6Q1:316:C4UHWACXX:5:1101:1769:2172 at Lines 137 and 141 ERROR on Line 149: Repeated Sequence Identifier: D3NT6Q1:316:C4UHWACXX:5:1101:1865:2180 at Lines 145 and 149 ERROR on Line 157: Repeated Sequence Identifier: D3NT6Q1:316:C4UHWACXX:5:1101:1905:2186 at Lines 153 and 157 Finished processing CNS_SVLRNA_ATCACG_L005_R2R1_001.fastq.09.fq with 438154808 lines containing 109538702 sequences. There were a total of 54769351 errors. Returning: 1 : FASTQ_INVALID

NOTE that only reads whose previous line ends in a "<" are the trouble ones:

 1  @D3NT6Q1:316:C4UHWACXX:5:1101:1498:2147 1:N:0:ATCACG/1
 2  CCGGATCTCAGATAAAGTAAA
 3  +
 4  BBBFFFFFFFFFFFFIFFFII
 5  @D3NT6Q1:316:C4UHWACXX:5:1101:1498:2147 2:N:0:ATCACG/2
 6  TTACTTTATCTGAGATCCGGGATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATTAAA
 7  +
 8  BBBFFFFFFFFFFIIIIIIIIIFIIFFIIFIIIFFIIIIIFFIBIII<B<7B0<BBFFFBFF'<<<<B0'777<<77<
 9  @D3NT6Q1:316:C4UHWACXX:5:1101:1688:2066 1:N:0:ATCACG/1
10  GTCCGCTGCTTTAGACCCGAAACCAGGTGATCTAGCCATGCGCAGGATGAAGGTGCGGTAACACGCACTGGAGGTCCGAACCAGTGCCCGTTGAAAAGG
11  +
12  0<FFFFFFFFFFIIIIIIIIIIIIIIIFFIIIIIIIIIIIFIIIIIIIIIIIFFFFFFBFFBFFFFFFFFBBFFFFBBBFFFFFBBBFBB7BB<BBBBB
13  @D3NT6Q1:316:C4UHWACXX:5:1101:1688:2066 2:N:0:ATCACG/2
14  GTCCAGGTTTGATTGGCCTTTCACCCCTAACCACACGTCATCCAAGACCTTTTCAACGGGCACTGGTTCGGACCTCCAGTGCGTGTTACCGCACCTTC
15  +
16  BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIFIIIIIIIIIFFFBFFFFIIFFIIFFFFFFFFFFFFFFFFBBBFFFBB7B<BB<BBB<7BB
17  @D3NT6Q1:316:C4UHWACXX:5:1101:1607:2071 1:N:0:ATCACG/1
18  GGGCGGCGGGTAACCCCGCGGGGGTGGAGCGGGAGGAACA
19  +
20  0<FFFFFFFFFFFBFFFFFFFFFF0<BBFFFFFBFFBBBF
21  @D3NT6Q1:316:C4UHWACXX:5:1101:1607:2071 2:N:0:ATCACG/2
22  GTTCCTCCCGCTCCACCCCCGCGGGGTTACCCGCCGCCCCGATCGGCGGACAGTAGAA
23  +
24  BBBFFFFFFFFFFIFFIIIIIFFIFF'0007B7<B<7<BF07'70''7<'00B0<<'<
25  @D3NT6Q1:316:C4UHWACXX:5:1101:1739:2086 1:N:0:ATCACG/1
26  GCGGGGCCACGTGCTGAGTGCTCGTCACTCTTCGGCCCCTGGGAAGGTCTGAGACA
27  +
28  0<BFFFFBFFFFIIIIIIFFFIIIIFFFFIIIIIIIIFFFFFFFBFFBBFFFFFBF
29  @D3NT6Q1:316:C4UHWACXX:5:1101:1739:2086 2:N:0:ATCACG/2
30  GTCTCAGACCTTCCCAGGGGCCGAAGAGTGACGAGCACTCAGCACGTGGCCCCGCTGATCGTCGGACTGTAGAACTCTGAACG
31  +
32  BBBFFFFFFFFFFIIIIIIIIIIIIIIIFFFIIIIIIFIIIIIIIIBFFFFF<<BFFFFFFBBFFFFBFBBBBBBBBB<00<B
33  @D3NT6Q1:316:C4UHWACXX:5:1101:1524:2116 1:N:0:ATCACG/1
34  ACGGCCCTGGCGGAGCGCTGAGAAGACGGTCGAACTTGACTATCTA
35  +
36  BBBFFFFFFFFFFII<FFBFIB7FFF<BFFBFFFF<BFFFFFBFFF
37  @D3NT6Q1:316:C4UHWACXX:5:1101:1524:2116 2:N:0:ATCACG/2
38  AGATAGTCAAGTTCGACCGTCTTCTCAGCGCTCCGCCAGGGCCGTGATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCGGTATCC
39  +
40  BBBFFFFFFFFFFIFIFFIFIIIIFFIFFFBFFIFFIIIFFFFIBFBBFFFBBF7<7B7BBBB<B<0<'007'000<0<'''<B077'''''77'7<<
41  @D3NT6Q1:316:C4UHWACXX:5:1101:1573:2135 1:N:0:ATCACG/1
42  GGCTGGTCCGAAGGTAGTGAGTTATCTCAA
43  +
44  BBBFFFFBFFFBFF0<B<BBF<FBFFFFIF
45  @D3NT6Q1:316:C4UHWACXX:5:1101:1573:2135 2:N:0:ATCACG/2
46  TGAGATAACTCACTACCTTCGGACCAGCCGATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGG
47  +
48  BBBFFFFFFFFFFIIFFFIIFIFFFFFFFFFFBF7BFF<FBFBBFFF77BBBB<<<0<7<<7'07007'<<
49  @D3NT6Q1:316:C4UHWACXX:5:1101:1533:2142 1:N:0:ATCACG/1
50  CTGGTCCGAAGGTAGTGAGTTATCTCAATA
51  +
52  BBBFFBFF<<FF<BFBFFFBFBFFBFIIII
53  @D3NT6Q1:316:C4UHWACXX:5:1101:1533:2142 2:N:0:ATCACG/2
54  ATTGAGATAACTCACTACCTTCGGACCAGGATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCG
55  +
56  BBBFFFFFFFFFFBFIIFFFFFIIIIBFFIIFBFFFFFFFFF<BFFF07BFF7<7B'77<B<77B<FB<BF07<
57  @D3NT6Q1:316:C4UHWACXX:5:1101:1738:2147 1:N:0:ATCACG/1
58  CGGGCCTCATAACCCAATTCAGACTACTCTCCCCCGCCCTCA
59  +
60  BBBFFFFFFFFFFIIFIIIIIIIIIIIIIIIFFFFIIIIIII
61  @D3NT6Q1:316:C4UHWACXX:5:1101:1738:2147 2:N:0:ATCACG/2
62  GGCGGGGGAGAGTAGTCTGAATTGGGTTATGAGGCCCGGATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGGCGCCG
63  +
64  FFFFFFFFBBBB<BBBFFFFFFFFFFBFBFFFFFFFFBFFFFFB07BBBFFFFFFBBBBBFBBBB7B07<B<<BBBBBB<'7777<
65  @D3NT6Q1:316:C4UHWACXX:5:1101:1519:2162 1:N:0:ATCACG/1
66  TGGGAGCGGGCGGGCGGC
67  +
68  0<<B<FBFFF<BFI<BB<
69  @D3NT6Q1:316:C4UHWACXX:5:1101:1519:2162 2:N:0:ATCACG/2
70  GCCGCCCGCACGCTCCCAGATCGTCGGACTGGAGAACTCTGA
71  +
72  7'7<'00<''<BB0<<'<<7BBB<'07<'07'70<00'007<
73  @D3NT6Q1:316:C4UHWACXX:5:1101:1541:2185 1:N:0:ATCACG/1
74  GGCTGGTCCGAAGGTAGTGAGTTA
75  +
76  BBBFFFFFFFFFFI<BFBFBFFFF
77  @D3NT6Q1:316:C4UHWACXX:5:1101:1541:2185 2:N:0:ATCACG/2
78  TAACTCACTACCTTCGGACCAGCCGACCGGCGGACTGCAGAAATCTGAAC
79  +
80  0''0'00'0<<'0'B<7<B''0''0<'<<'<<<'0'0''7770''0770<
81  @D3NT6Q1:316:C4UHWACXX:5:1101:1628:2198 1:N:0:ATCACG/1
82  GGCTGGTCCGAAGGTAGTGAGTTATCTCAATA
83  +
84  BBBB<F0FFB<<BB0BF<FF<7<FFIIIBFBF
85  @D3NT6Q1:316:C4UHWACXX:5:1101:1628:2198 2:N:0:ATCACG/2
86  ATTGAGATAACTCACTACCTTCGGACCAGCCGATCGTCGGACTGTAGAACTCTGAACGTGCAGATC
87  +
88  BBBFFFFFFFFFFIIIIIIIIIFFFIFBFFFBFFFFBFBBBF7B<FFFB<BF'007B70''0<0<<
89  @D3NT6Q1:316:C4UHWACXX:5:1101:1732:2209 1:N:0:ATCACG/1
90  GGTGGAGGCTCGTAGCGGTACTGACGTGCAAATCGTTCGTCAAATTA
91  +
92  BB<FFBFFFFFFF<BBFFBFFFFBBFFFFFFFBBFBFFFBFBFFFIF
93  @D3NT6Q1:316:C4UHWACXX:5:1101:1732:2209 2:N:0:ATCACG/2
94  AATTTGACGAACGATTTGCACGTCAGTACCGCTACGAGCCTCCACCGATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGT
95  +
96  BBBFFFFFFFFFFFFIIIFFFIFFFIFIFFFFFFFIIIFFFIIFFIIFFF<BBBFBFFB<<BBB<<0<BBB<B''00<<<'<<BBBB<<
97  @D3NT6Q1:316:C4UHWACXX:5:1101:1708:2231 1:N:0:ATCACG/1
98  GGCTGGTCCGAAGGTAGTGAGTTATCTCAATA
99  +

100 BBBFFFFFFFFFFIBFFFFIIFFFIIIIIIII 101 @D3NT6Q1:316:C4UHWACXX:5:1101:1708:2231 2:N:0:ATCACG/2 102 ATTGAGATAACTCACTACCTTCGGACCAGCCGATCGTCGGACTGTAGAACTCTGAACGTGTAGATATCGG 103 + 104 BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIFIIFIFFFFFIFIIBFFFBFB7BB<BB<7<70'77<< 105 @D3NT6Q1:316:C4UHWACXX:5:1101:1939:2092 1:N:0:ATCACG/1 106 TGAGTGTCGGAAAGGCGTCAGAGGAGTTGACTCTGATTGCTGGCGACAGTTGCGACTGAGGTTGTCGAAATGGTTT 107 + 108 0<FFFFFFFFFFIIIIIFFFIIIIFIFFFIIIIIIIIIIIIIIIIIIIIFFFFFFFFF<BF<BBFFBBBFFBBBBB 109 @D3NT6Q1:316:C4UHWACXX:5:1101:1939:2092 2:N:0:ATCACG/2 110 CCATTTCGACAACCTCAGTCGCAACTGTCGCCAGCAATCAGAGTCAACTCCTCTGACGCCTTTCCGACACTCACGATCGTCGGACTGTAGAACTCTG 111 + 112 FFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFBFFFFBFIIIIFFFFFFFFFFFFFFFFFFFBBFFFFB7BBBFFBBFBB<<<<< 113 @D3NT6Q1:316:C4UHWACXX:5:1101:1827:2130 1:N:0:ATCACG/1 114 TGCGCAGCCTGGGACGA 115 + 116 BBBFFFFFFFFFFIIII 117 @D3NT6Q1:316:C4UHWACXX:5:1101:1827:2130 2:N:0:ATCACG/2 118 CGTCCCAGGCTGCGCAGATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAA 119 + 120 BBBFFFFFFFFFFIIIIIIIIFFIIIIIIFFIIIFFFFFFBFFBF<FFFBFB<0<'<777<BB<00'7'7<<<'7< 121 @D3NT6Q1:316:C4UHWACXX:5:1101:1858:2140 1:N:0:ATCACG/1 122 TTCCCCGCGGGGCCCCGTCGTCCCCCGCGTCGTCGCCACCTCTCTTCCCCCCTCCTTCTTCCCGTCGGGGGCGGGGCAGGTCGGAAAAGCACA 123 + 124 BBBFFFBBFFFFBFIIF0<<7<<BFB'0<0<B0<<7B'7B7B'<7B<<BBBB'7B00''0'0B000'07<'0'07'''''00'7'7'00077B 125 @D3NT6Q1:316:C4UHWACXX:5:1101:1858:2140 2:N:0:ATCACG/2 126 GCCCCGACCCCGACGGGAAGAAGGAGGGGGGAAGAGAGGTGGCGACGACGCGGGGGACGACGGGGCCCCGCGGGGAAGATCGTCGGACTGTAGAA 127 + 128 BBBFFFFFFFFFFIIIIFFFFFIIFIIIIFF7BFFFFFFBBFBBFFF77B<BBFFF7BFBBBBBF077BBB<BFB00<''0<0BBB'07<00B<B 129 @D3NT6Q1:316:C4UHWACXX:5:1101:1950:2161 1:N:0:ATCACG/1 130 GGTAATCTTGTGAAACTCTGTCGA 131 + 132 BB<BFFFFFFFFFIIIIIIIIBFF 133 @D3NT6Q1:316:C4UHWACXX:5:1101:1950:2161 2:N:0:ATCACG/2 134 CAGAGTTTCACAAGATTACCGATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTAGGTGGTCGGCGTATCATTAAAAA 135 + 136 FFFFFFFFFFIIIIIIIII7FFFFFIIFIFIFFFIF<F<<<FFF7<<<B<BF<<7B'7'07B<77'070070'00''<<B 137 @D3NT6Q1:316:C4UHWACXX:5:1101:1769:2172 1:N:0:ATCACG/1 138 CCCGCCGGGGTCGGA 139 + 140 BBBFFFFFFFFFFIB 141 @D3NT6Q1:316:C4UHWACXX:5:1101:1769:2172 2:N:0:ATCACG/2 142 CCCCGGCGGGGGTCGTCGG 143 + 144 BBBB7<<BBBB'077'70< 145 @D3NT6Q1:316:C4UHWACXX:5:1101:1865:2180 1:N:0:ATCACG/1 146 CCGGGCCTCATAACCCAATTCAGACTACTCTCCCCCGCCCTCCA 147 + 148 BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIFIFIBFFFIIIIF 149 @D3NT6Q1:316:C4UHWACXX:5:1101:1865:2180 2:N:0:ATCACG/2 150 GGAGGGCGGGGGAGAGTAGTCTGAATTGGGTTATGAGGCCCGGGATCGTCGGACTGTAGAACTCTGAACGTGCAGATCTCGGTGG 151 + 152 BBBFFFFFFFFF<<7<0<BBBFFBBFFFFF7BBBBBFBBFFBFBBBFFBB7<<BBB<BBFBB<<BBB0<70'''00'00<<77<< 153 @D3NT6Q1:316:C4UHWACXX:5:1101:1905:2186 1:N:0:ATCACG/1 154 GTGCCATCCGAGGAAAGATTGGATTGCCTCATAGCATCAAATTAAGCAGAAGACGTTCCCGAAGCAAAAGTCCATTTAGGAAAGACAAGAGCCCTGTGAG 155 + 156 BBBFBFFFFFFFFFFIFIIIIFFFFFBFF0BBFFFFIIFFBFFFB<FFFFIIIIF77B'BFB<BFBBBBBF0B<B<B<<<BB7<B<B<7BB<77B7<BBB 157 @D3NT6Q1:316:C4UHWACXX:5:1101:1905:2186 2:N:0:ATCACG/2 158 GGTCCGGAGCTCACAGGGCTCTTGTCTTTCCTAAATGGACTTTTGCTTCGGGAACGTCTTCTGCTTAATTTGATGCTATGAGGCAATCCAATCTTTCC 159 + 160 BBBFFFFFFFFFFIIFIIIIIIBFFFFFIBBFIIFFBFFIBFFFFFIFIIIFBBFFBFFFBFFFFFFFFFFBBF<BBBFBBBF7BBBBBB<BBBFBFB 161 @D3NT6Q1:316:C4UHWACXX:5:1101:1864:2202 1:N:0:ATCACG/1 162 ACGGCCCTGGCGGAGCGCTGAGAAGACGGTCGAACA 163 + 164 BBBFFBFFFFFBF<FFFIFF<FFFFIFIIFFFFFFB 165 @D3NT6Q1:316:C4UHWACXX:5:1101:1864:2202 2:N:0:ATCACG/2 166 GTTCGACCGTCTTCTCAGCGCTCCGCCAGGGCCGCGGTCGTCGGACTGCAGAAATTCCGAGCCCCTTCGTCCCGTGGCCGCCGTACCGATCATAAC 167 + 168 BBBBFF<BFBFFBB<FB<<BB'<<0BFFFIFFF<'0''0'0000'''0''0'0'0''0'''''0''0''07<0'7'07<'7'7'''0'0'''000< 169 @D3NT6Q1:316:C4UHWACXX:5:1101:1914:2207 1:N:0:ATCACG/1 170 CTAGTGGTTAGGATTCGGCA 171 + 172 BBBFBFFFFFFFFIIIIFFF 173 @D3NT6Q1:316:C4UHWACXX:5:1101:1914:2207 2:N:0:ATCACG/2 174 GCCGAATCCTAACCACTAGGATCGTCGGACTGTAGAACTCTGAACGTGGGGGTCCTGGTGGTGG 175 + 176 BBBFFFFFFFFFFIFBBFFFFIIIFIII7FBFFFFBFF'BBB07BB<B''0770''7<<0<00< 177 @D3NT6Q1:316:C4UHWACXX:5:1101:2092:2058 1:N:0:ATCACG/1 178 GCTGGTCCGAAGGTAGTGAGTTATCTCAATA 179 + 180 0<BFFFFFFFFFBBFF0FBF<BFBFFIFF<F 181 @D3NT6Q1:316:C4UHWACXX:5:1101:2092:2058 2:N:0:ATCACG/2 182 ATTGAGATAACTCACTACCTTCGGACCAGCCGATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATTGAA 183 + 184 BBBFFFFFFFFFFIIIIIIIIIIIIIIFIIIIIFBFBFFIBFFI<B<0B<FB<<'B<B070<B0<B<<7BBB<07<777'''''00'0< 185 @D3NT6Q1:316:C4UHWACXX:5:1101:2166:2061 1:N:0:ATCACG/1 186 CCCTGGTGGTCTAGTGGTTAGGATTCGGCGCA 187 + 188 0<FFFFBFFFFFIIFFIFIFFIFFIIIIIIII 189 @D3NT6Q1:316:C4UHWACXX:5:1101:2166:2061 2:N:0:ATCACG/2 190 GCGCCGAATCCTAACCACTAGACCACCAGGGAGATCGTCGGACTGTAGAACTCAGAACGTGTAGATCTAGGTGG 191 + 192 BBBFFFFFFFFFFIIFIIIIIIIFIIIIIIIFFFBFIFIIIFBBFFIFFFFFFFFFFFB<B77<<0<<'7<7<B 193 @D3NT6Q1:316:C4UHWACXX:5:1101:2206:2064 1:N:0:ATCACG/1 194 TACTTCTTTCATTTTATCACCATTATAAAGTAGGGATTCTACCATTTGCAAGAGGAAATACCTTACCTCCCTCAAGCTCACTTTTTATTTGCTTAAG 195 + 196 0<FFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIFIIIIIIIFIFFFFFFIIIFFFFFFFFFFFFFBFFFFFFBBBB 197 @D3NT6Q1:316:C4UHWACXX:5:1101:2206:2064 2:N:0:ATCACG/2 198 CACAAAGGCAATGGGATACATGGCAATGCATATAAACAATATGTTTATGATAAACTTATTAACAGTCAGACGGATAAAGAGTTCTTATCTTCTCTTAAGCA 199 + 200 BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIIIIIFFFFIIIIIIIIIIIIIIIFIIIIFFFFFFFBFBFFFFFFFFFFFFBBBFF

I would expect this file to pass QC

Thanks

mktrost commented 9 years ago

Every other record is failing validation. The current version of fastqValidator only reads the sequence identifier up to the first and assumes anything after that is a description and not part of the sequence identifier. It also by default checks that each sequence identifier in the fastq is unique. For CASAVA 1.8 files, the read indicator is after the space, so for fastq files with both pairs, the FastQValidator will see multiple records with the same sequence identifier and will by default fail.

Quick solution - disable the unique sequence id validation. That will ignore those errors. To do that, add the --disableSeqIDCheck flag.

Thank you for pointing out this issue. I think it will be nice to add an option to validate a fastq containing both read pairs. Unfortunately, I probably won't be able to make this enhancement for a of couple weeks or early next year.

Mary Kate Wing

gawbul commented 9 years ago

Just out of interest, has this issue been patched? It seems the code hasn't been updated in some time, so I assume not?

mktrost commented 9 years ago

Sorry, you are correct, it was not fixed. I just committed updates to libStatGen and FastqValidator to add a --interleaved option that should make this work. Please give it a try and let me know if it works. If it does, we can close this issue.

Sorry for the delay. Thank you for checking back in with me.

gawbul commented 9 years ago

Excellent! Thanks for the update! Will give it a try :)