lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.37k stars 308 forks source link

seqtk sample adding ">" to some read IDs #110

Closed mclaugsf closed 6 years ago

mclaugsf commented 6 years ago

When I run seqtk sample some of the read IDs are getting the "@" character swapped out for a ">".

ubuntu@ip-10-0-0-200:/efs/Inputs/human-datasets/ASV_0009$ gunzip -c ASV_0009.Tumor.RNA.CeGat.R1.fastq.gz | grep ">HISEQ" | wc -l
0

There are no reads which match ">HISEQ" in the input. But they come out like this w/seqtk:

ubuntu@ip-10-0-0-200:/efs/Inputs/human-datasets/ASV_0009$ seqtk sample ASV_0009.Tumor.RNA.CeGat.R1.fastq.gz 0.01 | grep ">HISEQ" | head
>HISEQ:103:H3KNHBCXX:1:1101:15244:7826 1:N:0:GTGCTT
>HISEQ:103:H3KNHBCXX:1:1101:1383:12560 1:N:0:GTGCTT
>HISEQ:103:H3KNHBCXX:1:1101:16556:20181 1:N:0:GTGCTT
>HISEQ:103:H3KNHBCXX:1:1101:20083:32828 1:N:0:GTGCTT
>HISEQ:103:H3KNHBCXX:1:1102:6971:11043 1:N:0:GTGCTT
>HISEQ:103:H3KNHBCXX:1:1102:14390:13964 1:N:0:GTGCTT
>HISEQ:103:H3KNHBCXX:1:1102:16661:48303 1:N:0:GTGCTT
>HISEQ:103:H3KNHBCXX:1:1102:20756:86659 1:N:0:GTGCTT
>HISEQ:103:H3KNHBCXX:1:1103:5772:25970 1:N:0:GTGCTT
>HISEQ:103:H3KNHBCXX:1:1103:20723:75978 1:N:0:GTGCTT

And if I take the first one out of the input file it doesn't have the leading ">" character:

ubuntu@ip-10-0-0-200:/efs/Inputs/human-datasets/ASV_0009$ gunzip -c ASV_0009.Tumor.RNA.CeGat.R1.fastq.gz | grep HISEQ:103:H3KNHBCXX:1:1101:15244:7826
@HISEQ:103:H3KNHBCXX:1:1101:15244:7826 1:N:0:GTGCTT

Happening in version 1.2-r94 and I also tried 1.2-r102-dirty

mclaugsf commented 6 years ago

I see that all the read IDs it does this to don't have sequence or quality scores i.e.:

ubuntu@ip-10-0-0-200:/efs/Inputs/human-datasets/ASV_0009$ gunzip -c ASV_0009.Tumor.RNA.CeGat.R1.fastq.gz | grep HISEQ:103:H3KNHBCXX:1:1101:15244:7826 -A4
@HISEQ:103:H3KNHBCXX:1:1101:15244:7826 1:N:0:GTGCTT

+

@HISEQ:103:H3KNHBCXX:1:1101:15148:7833 1:N:0:GTGCTT

seqtk sample turns this into:

ubuntu@ip-10-0-0-200:/efs/Inputs/human-datasets/ASV_0009$ grep "HISEQ:103:H3KNHBCXX:1:1101:15244:7826" blah1.fastq.gz  -A4
>HISEQ:103:H3KNHBCXX:1:1101:15244:7826 1:N:0:GTGCTT

@HISEQ:103:H3KNHBCXX:1:1101:2877:8014 1:N:0:GTGCTT
TTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+

Is the FASTQ spec being broken by having a read ID for a read that doesn't exist in the original input?

shenwei356 commented 6 years ago

Same problem occured a few days ago: https://github.com/lh3/seqtk/issues/109 , but it' hard to change in near future. You may use seqkit (>=v0.8.0) for now, which just fix this.

mclaugsf commented 6 years ago

Thanks! I thought it was a bug.