Closed nityendra21 closed 1 year ago
I'd like to request a flag that skips those 999 pattern matches if it is matched once (or more, can be user defined)
Which one is matched? In the order of given query sequences or randomly, the results would be different.
Did you try the -F/--use-fmi
flag (also give a bigger -j
)? It should increase the speed.
-F, --use-fmi use FM-index for much faster search of lots of sequence patterns
Yes I did use the maximum possible value for -j
and I'll use the -F
flag and get back to you!
The database I'm using to pattern match is quite big (~300GB) with sequences having the same core genome, and thus the unitig fragments matching the core genome get repeatedly pattern matched, which is fine but it increases the time by a lot.
In my case, using a single node consisting of 80 CPUs and 600GB of RAM to generate the results using seqkit locate
took 1.5 weeks and was still not completed. Hence the suggestion that if one pattern is matched (unitig 1) once or twice to the database to which I'm comparing to the process move on to the next sequence (unitig 2).
I see, you'd like to filter unitigs that match one or two sequences in a really big sequence collection. The position and strand of the unitigs in the target sequence are not important. Am I right?
I'd recommend another way using tools like KMCP, which index a large collection of sequences and provide fast queries.
Make the index for the "databases", a big fasta file.
# compute k-mers of all sequences
kmcp compute -k 31 --by-seq database.fasta.gz --out-dir db-k31/
# index the k-mers
kmcp index --in-dir db-k31/ --out-dir db.kmcp/
Search with unitigs, requiring a query coverage of 100%, i.e., perfect match.
kmcp search -d db.kmcp/ --min-query-cov 1 query.fasta -o query.fasta.tsv.gz
Filter the result, requiring at least N matches for each query, and export the query IDs.
zcat query.fasta.tsv.gz | awk '$5 >=1' | cut -f 1
Precisely. Thank you very much for the suggestion!
seqkit version
I'm using
seqkit locate
to match patterns of a fasta file against another:cat database.fa | seqkit locate -i -I -f query_file.fa -o result.txt
For example:
database.fa
consists of 1000 sequences whereasquery_file.fa
consists of 100 unitigs. If the pattern of unitig 1 matches all 1000 sequences, it'll identify those and write them to output and then move on to unitig 2. I'd like to request a flag that skips those 999 pattern matches if it is matched once (or more, can be user defined), thus increasing the speed of theseqkit locate
process by a great deal.Thanks in advance! :-)