shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.3k stars 158 forks source link

[Feature suggestion] Addition of a flag in seqkit locate #357

Closed nityendra21 closed 1 year ago

nityendra21 commented 1 year ago

Prerequisites

I'm using seqkit locate to match patterns of a fasta file against another: cat database.fa | seqkit locate -i -I -f query_file.fa -o result.txt

For example: database.fa consists of 1000 sequences whereas query_file.fa consists of 100 unitigs. If the pattern of unitig 1 matches all 1000 sequences, it'll identify those and write them to output and then move on to unitig 2. I'd like to request a flag that skips those 999 pattern matches if it is matched once (or more, can be user defined), thus increasing the speed of the seqkit locate process by a great deal.

Thanks in advance! :-)

shenwei356 commented 1 year ago

I'd like to request a flag that skips those 999 pattern matches if it is matched once (or more, can be user defined)

Which one is matched? In the order of given query sequences or randomly, the results would be different.

Did you try the -F/--use-fmi flag (also give a bigger -j)? It should increase the speed.

 -F, --use-fmi                   use FM-index for much faster search of lots of sequence patterns
nityendra21 commented 1 year ago

Yes I did use the maximum possible value for -j and I'll use the -F flag and get back to you! The database I'm using to pattern match is quite big (~300GB) with sequences having the same core genome, and thus the unitig fragments matching the core genome get repeatedly pattern matched, which is fine but it increases the time by a lot.

In my case, using a single node consisting of 80 CPUs and 600GB of RAM to generate the results using seqkit locate took 1.5 weeks and was still not completed. Hence the suggestion that if one pattern is matched (unitig 1) once or twice to the database to which I'm comparing to the process move on to the next sequence (unitig 2).

shenwei356 commented 1 year ago

I see, you'd like to filter unitigs that match one or two sequences in a really big sequence collection. The position and strand of the unitigs in the target sequence are not important. Am I right?

I'd recommend another way using tools like KMCP, which index a large collection of sequences and provide fast queries.

  1. Make the index for the "databases", a big fasta file.

    # compute k-mers of all sequences
    kmcp compute -k 31 --by-seq database.fasta.gz --out-dir  db-k31/
    
    # index the k-mers
    kmcp index --in-dir  db-k31/ --out-dir db.kmcp/
  2. Search with unitigs, requiring a query coverage of 100%, i.e., perfect match.

    kmcp search -d db.kmcp/ --min-query-cov 1 query.fasta -o query.fasta.tsv.gz

  3. Filter the result, requiring at least N matches for each query, and export the query IDs.

    zcat query.fasta.tsv.gz | awk '$5 >=1' | cut -f 1
nityendra21 commented 1 year ago

Precisely. Thank you very much for the suggestion!