shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.29k stars 157 forks source link

Runtime of seqkit amplicon in v2.7.0 is much longer than v2.3.0 #439

Closed fsnibs10 closed 7 months ago

fsnibs10 commented 7 months ago

Hi developers,

Recently, I used seqkit amplicon to extract the target sequence from the compressed FASTQ file by giving primer sequence. I downloaded the latest version (v2.7.0) and the previous version (v2.3.0).

The download command is shown below. wget https://github.com/shenwei356/seqkit/releases/download/v2.3.0/seqkit_linux_amd64.tar.gz wget https://github.com/shenwei356/seqkit/releases/download/v2.7.0/seqkit_linux_amd64.tar.gz

I found that runtime of amplicon module in the latest version (seqkit v2.7.0) is much longer than seqkit v2.3.0. The file size of the sequencing data is about 550Mb, including 5671607 reads. With the same command and server computer, seqkit amplicon v2.3.0 runs very fast, taking 20 seconds. While the execution time of version 2.7.0 is about 7 minutes. I don't know why. My command is shown bleow.

seqkit amplicon --threads 8 -F AAGAGTGGAG -R GTTCATCC -o sample.read1.fq read1.fq.gz

shenwei356 commented 7 months ago

Thanks for reporting this. This bug was introduced in v2.7.0. It's fixed now.

seqkit_linux_amd64.tar.gz

Besides, after checking the code, I think I can make it faster.

shenwei356 commented 7 months ago

Use this, it's slightly faster.

fsnibs10 commented 7 months ago

Thanks! I have tested this improved version with the same dataset. It takes about 20 seconds, very fast.

shenwei356 commented 5 months ago

@fsnibs10 Sorry, the previous changes introduced a bug. see #457 . It occurred when more than 2 pairs of primers were given.