Open smehringer opened 1 year ago
Hi Svenja,
Is the search of kmtricks resp. HowDeSBT equivalent? Meaning that if I use kmtricks, the search timings and results are the same as if I would use the original HowDeSBT index/query.
Yes. We use a modified version of HowDeSBT in which we have changed the hash function to fit the one used in kmtricks for constructing the bloom filters. This does not impact the query time.
Another question: How do I determine the Bloomfilter Size?
This question also applies to HowDeSBT alone. This depends on the acceptable false positive rate and on your available disk. You may look for instance to this site (with one unique hash function) https://hur.st/bloomfilter/?k=1 to know the relation between the number of elements in the filter, the size of the filter, false positive rate.
Good to know We have released a simplified version of kmtricks that does not use HowDeSBT anymore. This is called kmindex https://github.com/tlemane/kmindex. The bloom filter construction is twice faster, and the query time is reduced by several orders of magnitude. However, without HowDeSBT, the final index is bigger. On complex metagenomic data (Tara ocean) the index is 10% bigger. The output is the same as the one provided by kmtricks.
In both cases (kmtricks+howDeSBT or kmindex), it is possible to use the findere trick to decrease false positives.
I hope this helps.
Best, Pierre
Hi Svenja,
Some additional info:
Is the search of kmtricks resp. HowDeSBT equivalent? Meaning that if I use kmtricks, the search timings and results are the same as if I would use the original HowDeSBT index/query.
Yes. We use a modified version of HowDeSBT in which we have changed the hash function to fit the one used in kmtricks for constructing the bloom filters. This does not impact the query time.
Another question: How do I determine the Bloomfilter Size?
This question also applies to HowDeSBT alone. This depends on the acceptable false positive rate and on your available disk. You may look for instance to this site (with one unique hash function) https://hur.st/bloomfilter/?k=1 to know the relation between the number of elements in the filter, the size of the filter, false positive rate.
To quickly estimate the number of elements in each filter (= number of distinct k-mers), you can use ntCard on each sample (https://github.com/bcgsc/ntCard). Then you can compute the right size according to the maximum number of distinct k-mers.
Good to know We have released a simplified version of kmtricks that does not use HowDeSBT anymore. This is called kmindex https://github.com/tlemane/kmindex. The bloom filter construction is twice faster, and the query time is reduced by several orders of magnitude. However, without HowDeSBT, the final index is bigger. On complex metagenomic data (Tara ocean) the index is 10% bigger. The output is the same as the one provided by kmtricks.
In both cases (kmtricks+howDeSBT or kmindex), it is possible to use the findere trick to decrease false positives.
You can read more about findere here: https://github.com/lrobidou/findere
With a description of your dataset, I will be better able to suggest the right pipeline. You can send me any useful information at teo[dot]lemane[at]proton[dot]me.
Téo
Hi Pierre, hi Téo,
thanks a lot for your quick replies! This answers all of my questions.
We were on the right track then but wanted to make sure to have a fair comparison (without errors on our side using the tools). We are working on a similar data structure that supports AMQs and want to compare ourselves to you.
I will try running kmtricks
and kmindex
now and report back with any feedback that comes up.
Best, Svenja
EDIT: We plan to test the tools on RefSeq (all complete genomes) and part of the 40k RNA Seq Files from the most recent Mantis paper.
I already stumbled over the first issue:
In the example one should build the index after kmtricks pipeline
with
kmtricks index ...
But in version v1.2.1 the subcommand index
does not exist (only [pipeline|dump|aggregate|infos]
). The subcommand query
also does not seem to exist.
The example is probably outdated. Should I work on a former version of kmtricks?
But in version v1.2.1 the subcommand
index
does not exist (only[pipeline|dump|aggregate|infos]
). The subcommandquery
also does not seem to exist.
I didn't set some options when building, I updated the binary.
Now it's:
kmtricks [pipeline|repart|superk|count|merge|format|filter|dump|aggregate|index|query|infos]
Exactly. We should make this building option (-w) clearer in the doc.
EDIT: We plan to test the tools on RefSeq (all complete genomes) and part of the 40k RNA Seq Files from the most recent Mantis paper.
The use of kmtricks to generate indexes was originally intended for collections (hundreds or thousands) of large sequencing samples (like Tara metagenomes). Unfortunately, there is a known issue when the number of samples to index is very large, as in your case of genome indexing. I think you will encounter a problem related to the number of simultaneously opened files. I have a plan to fix this but haven't found the time to do it yet.
The use of kmtricks to generate indexes was originally intended for collections (hundreds or thousands) of large sequencing samples (like Tara metagenomes). Unfortunately, there is a known issue when the number of samples to index is very large, as in your case of genome indexing. I think you will encounter a problem related to the number of simultaneously opened files. I have a plan to fix this but haven't found the time to do it yet.
Thank you for the heads-up! Luckily, I still have
$ulimit -Hn
1048576
from when I contacted our IT when trying to test some tools. So we should be able to work around that.
Hi there,
so the kmtricks pipeline seems to have troubles. Data is ~100GB (uncompressed), 25'000 files, RefSeq genomes.
Any ideas? File limit should be fine. We have 1TB RAM at our expense and max resident size was only ~55 GB (see info) so that's not the problem.
Side question:
Whats the difference between hash:bft:bin
and hash:bf:bin
? As the latter seems to be required by kmindex
but the example I'm following recommended to use the former.
Hi
I think Téo will confirm, but my guess is that the issue comes from the file limit. You have 139 partitions * 25000 files. That is 3475000, which is higher than your (already high) ulimit.
About the difference between hash:bft:bin
and hash:bf:bin
(again wait for a formal confirmation by Téo).
Hi,
The error seems to be related to the number of opened files. However, the first step (superk) should work with your configuration.
Recently I got some feedback from users who tried kmtricks to index genomes (tens of thousands of samples), leading to the identification of some issues in such a case:
I definitely have to fix that. Unfortunately, I don't know when I can do it.
In the meantime, I see two workarounds:
Sorry for the inconvenience.
Teo
Hi there,
thanks for the response. Scaling down the number of threads from 32 to 16 worked for now.
What does
query "[name]40049279" contains no searchable smers
mean? (Query length is 250, kmer size 32)
Is the query not searchable at all?
Thats a strange behavior. Have you checked this particular query 40049279 ? Does it contain non ACGT characters?
Sorry, my fault I think. I gave kmtricks a FASTQ file instead of FASTA (I noticed that only every 4th query did not have problems). Runnning it with FASTA again but it's taking quite some time.
After about 2h I stopped the command because I noticed I haven't provided the threads option (ops), but it seems that 80 threads is the default. In htop
it seemed that the whole time kmtricks was only using a single thread though.
Rerunning now with:
kmtricks query --run-dir ${KM_DIR}/kmtricks_index --query ${QUERY_FILE_FASTA} --threshold 0.7 --no-detail --threads 32 > ${KM_DIR}/kmtricks.result
[2023-01-27 12:05:30.478] [info] Run with Kmer<64> - __uint128_t implementation
Input is a 2.8 GB FASTA file with 10M queries of length 250.
Can you make an assumption on the expected runtime?
EDIT: Already an hour now with the above command, htop shows only a single thread being used and no output has been written yet. RAM usage is constantly at ~50G. Index size (full kmtrick_index directory) on disk is ~300G.
Hi there,
I would like to use
kmtricks
, to useHowDeSBT
as this example suggests that there is a convenient wrapper using the newest index build. Is the search ofkmtricks
resp.HowDeSBT
equivalent? Meaning that if I usekmtricks
, the search timings and results are the same as if I would use the original HowDeSBT index/query.Another question: How do I determine the Bloomfilter Size? in the example
kmtricks pipeline
needs this as a command line argument. But I don't how to choose an appropriate size for my data set.Thanks in advance, Svenja