lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.35k stars 310 forks source link

Subseq extracting reads with query name list failed #169

Open yangyxt opened 3 years ago

yangyxt commented 3 years ago

I used the seqtk/1.3 version and I use subseq to extract reads from fq file and it failed.

The in.fq is a simulated file so the query name is with an ascending ID number:

image

I use awk to confirm that the query names specified by name.lst file are existed in fq file. Then I tried to extract the first several reads, using a file stored the names of sim_sample_1_chr7-chr7-1, sim_sample_1_chr7-chr7-3, sim_sample_1_chr7-chr7-5(one name per line). And it worked. But if I chose a query name ranked far behind in the fq file, the extraction carried by seqtk subseq failed!

Upon my test, If I try to fetch read before query name sim_sample_1_chr7-chr7-343063, it all works well. Any query name comes behind this failed to be extracted.

Here I show u an example, First a screenshot of a test name.lst:

image

(I assure u every query name in this list exist in the in.fq file, confirmed by awk)

Then a screenshot of the extracted sequences by commanding seqtk subseq in.fq name.lst | less -S -

image

I was so confused why this happened?! Does seqtk only read a part of fq file into memory for inspection? Pls help take a look at this issue at ur convenience. Much appreciated.

yangyxt commented 3 years ago

I just used seqtk seq to view the same fastq file and found that it ends at the query name sim_sample_1_chr7-chr7-343063. Why there is a line limit here for seqtk to inspect on data, I don't see any introduction on the manual about this limit and any argument I can use to remove this restriction.