lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.37k stars 308 forks source link

Uncomplete output with seqtk sample #79

Closed annaquaglieri16 closed 8 years ago

annaquaglieri16 commented 8 years ago

Hi, I am not sure if this issue has already been discussed but I could not find it. I have a problem understanding the output of seqtk sample. Here is an example:

Input (8 lines = 2 reads)

unix302 593 % head -8 SRX959064_combined_1.fastq @SRR1918731.1 HWI-D00279:55:C4GHGACXX:1:2310:18899:10180 length=100 TGGGCGCCCCCTGCTGGCGACTAGGGCAACTGCAGGGCTCTCTTGCTTAGAGTGGTGGCCAGCGCCCCCTGCTGGCGCCGGGGCACTGCAGGGCCCTCTT +SRR1918731.1 HWI-D00279:55:C4GHGACXX:1:2310:18899:10180 length=100 BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIFIFFIIIFFFFFFFFFFFFFFBFBFBBFFFFFFFFFFFFFFFFBFFFFBFBBBBBFFFBBBFFB @SRR1918731.2 HWI-D00279:55:C4GHGACXX:1:2208:16499:59719 length=100 TGGAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATATTCTGGCCCCTGTTGTCTGCATGTA +SRR1918731.2 HWI-D00279:55:C4GHGACXX:1:2208:16499:59719 length=100 BBBFFFFFFFFFFIIIIIIFFIIIFFIIIIIIFIIIIIIIIIFFIIIIIIIIIIBFFIIIIIIIFFFFFFFFFFFBFFFFFFFFFFFBFFFFFFFFFFFF

After subsampling with seqtk sample -s1659 10000 SRX959064_combined_1.fastq > SRX959064_combined_1000_1.fastq

I get this output

unix302 598 % head -8 SRX959064_combined_1000_1.fastq @SRR1918731.47994563 HWI-D00279:55:C4GHGACXX:1:1110:14027:89645 length=100 GTGCTTGAGAAGATGTTTGTCCTGCATGGTGGAGAGTGGAGAAGGGCCAGGATTCTTAGGTTGATCTATCTGTGGGTTATGACTTCCCACAATAGCCACC + BBBFFFBFFFFFFIIFIIIFFFFFFFIIIFFFBFFIBFIIFIFFIFFIIIIIIIIIIIIIIIIIIIIFFFFFFFFBBBBFFFBFFFFFFFFFFFFFFFFF @SRR1918731.13178074 HWI-D00279:55:C4GHGACXX:1:1208:5953:4754 length=100 CTTTGATGTGAAAGGGGCAGCACAGTCATTTAAACTTGATCCAACCTCTTTGCATCTTACAAAGTTAAACAGCTAAAAGAAGTAAAATAAGAAGGCAATG + BBBFFFFFFFFFFFFIIIIIFIFIFFFFIIIIIIBFFFIIIIBFFIIIIFIFIFFIIIIIIIIIFIFFIIIIFFFFFFFBBFBB<<BBBBFBFBBFFFFB

where the third line of every entry (should be the same as the first one) is lost. When aligning with STAR I get problems and I have a feeling that it is because of this.

I hope you can help me, Thanks for your time!

Anna

lh3 commented 8 years ago

Have a look at the fastq wiki page. The stuff on every 3rd line is optional. In fact, few fastq writers (apart from fastq-dump) writes that 3rd line. I am pretty sure the star problem is not caused by this.