tlemane / kmtricks

modular k-mer count matrix and Bloom filter construction for large read collections
GNU Affero General Public License v3.0
76 stars 7 forks source link

Questions to kmtricks vs HowDeSBT #23

Open smehringer opened 1 year ago

smehringer commented 1 year ago

Hi there,

I would like to use kmtricks, to use HowDeSBT as this example suggests that there is a convenient wrapper using the newest index build. Is the search of kmtricks resp. HowDeSBT equivalent? Meaning that if I use kmtricks, the search timings and results are the same as if I would use the original HowDeSBT index/query.

Another question: How do I determine the Bloomfilter Size? in the example kmtricks pipeline needs this as a command line argument. But I don't how to choose an appropriate size for my data set.

Thanks in advance, Svenja

pierrepeterlongo commented 1 year ago

Hi Svenja,

Is the search of kmtricks resp. HowDeSBT equivalent? Meaning that if I use kmtricks, the search timings and results are the same as if I would use the original HowDeSBT index/query.

Yes. We use a modified version of HowDeSBT in which we have changed the hash function to fit the one used in kmtricks for constructing the bloom filters. This does not impact the query time.

Another question: How do I determine the Bloomfilter Size?

This question also applies to HowDeSBT alone. This depends on the acceptable false positive rate and on your available disk. You may look for instance to this site (with one unique hash function) https://hur.st/bloomfilter/?k=1 to know the relation between the number of elements in the filter, the size of the filter, false positive rate.

Good to know We have released a simplified version of kmtricks that does not use HowDeSBT anymore. This is called kmindex https://github.com/tlemane/kmindex. The bloom filter construction is twice faster, and the query time is reduced by several orders of magnitude. However, without HowDeSBT, the final index is bigger. On complex metagenomic data (Tara ocean) the index is 10% bigger. The output is the same as the one provided by kmtricks.

In both cases (kmtricks+howDeSBT or kmindex), it is possible to use the findere trick to decrease false positives.

I hope this helps.

Best, Pierre

tlemane commented 1 year ago

Hi Svenja,

Some additional info:

Is the search of kmtricks resp. HowDeSBT equivalent? Meaning that if I use kmtricks, the search timings and results are the same as if I would use the original HowDeSBT index/query.

Yes. We use a modified version of HowDeSBT in which we have changed the hash function to fit the one used in kmtricks for constructing the bloom filters. This does not impact the query time.

Another question: How do I determine the Bloomfilter Size?

This question also applies to HowDeSBT alone. This depends on the acceptable false positive rate and on your available disk. You may look for instance to this site (with one unique hash function) https://hur.st/bloomfilter/?k=1 to know the relation between the number of elements in the filter, the size of the filter, false positive rate.

To quickly estimate the number of elements in each filter (= number of distinct k-mers), you can use ntCard on each sample (https://github.com/bcgsc/ntCard). Then you can compute the right size according to the maximum number of distinct k-mers.

Good to know We have released a simplified version of kmtricks that does not use HowDeSBT anymore. This is called kmindex https://github.com/tlemane/kmindex. The bloom filter construction is twice faster, and the query time is reduced by several orders of magnitude. However, without HowDeSBT, the final index is bigger. On complex metagenomic data (Tara ocean) the index is 10% bigger. The output is the same as the one provided by kmtricks.

In both cases (kmtricks+howDeSBT or kmindex), it is possible to use the findere trick to decrease false positives.

You can read more about findere here: https://github.com/lrobidou/findere

With a description of your dataset, I will be better able to suggest the right pipeline. You can send me any useful information at teo[dot]lemane[at]proton[dot]me.

Téo

smehringer commented 1 year ago

Hi Pierre, hi Téo,

thanks a lot for your quick replies! This answers all of my questions.

We were on the right track then but wanted to make sure to have a fair comparison (without errors on our side using the tools). We are working on a similar data structure that supports AMQs and want to compare ourselves to you.

I will try running kmtricks and kmindex now and report back with any feedback that comes up.

Best, Svenja

EDIT: We plan to test the tools on RefSeq (all complete genomes) and part of the 40k RNA Seq Files from the most recent Mantis paper.

smehringer commented 1 year ago

I already stumbled over the first issue:

In the example one should build the index after kmtricks pipeline with

kmtricks index ...

But in version v1.2.1 the subcommand index does not exist (only [pipeline|dump|aggregate|infos]). The subcommand query also does not seem to exist.

The example is probably outdated. Should I work on a former version of kmtricks?

eseiler commented 1 year ago

But in version v1.2.1 the subcommand index does not exist (only [pipeline|dump|aggregate|infos]). The subcommand query also does not seem to exist.

I didn't set some options when building, I updated the binary.

Now it's:

kmtricks [pipeline|repart|superk|count|merge|format|filter|dump|aggregate|index|query|infos]
pierrepeterlongo commented 1 year ago

Exactly. We should make this building option (-w) clearer in the doc.

tlemane commented 1 year ago

EDIT: We plan to test the tools on RefSeq (all complete genomes) and part of the 40k RNA Seq Files from the most recent Mantis paper.

The use of kmtricks to generate indexes was originally intended for collections (hundreds or thousands) of large sequencing samples (like Tara metagenomes). Unfortunately, there is a known issue when the number of samples to index is very large, as in your case of genome indexing. I think you will encounter a problem related to the number of simultaneously opened files. I have a plan to fix this but haven't found the time to do it yet.

eseiler commented 1 year ago

The use of kmtricks to generate indexes was originally intended for collections (hundreds or thousands) of large sequencing samples (like Tara metagenomes). Unfortunately, there is a known issue when the number of samples to index is very large, as in your case of genome indexing. I think you will encounter a problem related to the number of simultaneously opened files. I have a plan to fix this but haven't found the time to do it yet.

Thank you for the heads-up! Luckily, I still have

$ulimit -Hn
1048576

from when I contacted our IT when trying to test some tools. So we should be able to work around that.

smehringer commented 1 year ago

Hi there,

so the kmtricks pipeline seems to have troubles. Data is ~100GB (uncompressed), 25'000 files, RefSeq genomes.

Build command: ``` # Build /project/archive-index-data/software/bin/kmtricks pipeline --file ${KM_DIR}/data.fof \ --run-dir ${KM_DIR}/kmtricks_index \ --kmer-size 32 \ --mode hash:bft:bin \ --hard-min 1 \ --soft-min 1 \ --share-min 1 \ --bloom-size 125488049 \ --bf-format howdesbt \ --threads 32 \ --cpr \ --skip-merge ```
kmtricks info log: ``` [2023-01-20 08:41:40.890] [info] Run with Kmer<64> - __uint128_t implementation [2023-01-20 08:41:41.331] [info] Compute configuration... [2023-01-20 08:41:41.331] [info] 25321 samples found (25321 read files). [2023-01-20 08:49:45.818] [info] Use 139 partitions. [2023-01-20 08:49:45.855] [info] Compute minimizer repartition... Compute SuperK [> ] [00m:00s] Count partitions [> ] [00:00s] terminate called after throwing an instance of 'std::runtime_error' terminate called recursively terminate called recursively terminate called recursively terminate called recursively what(): Unable to open /project/archive-index-data/smehringer//kmtricks_bench/kmtricks_index/superkmers/D6/skp.70 terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively [2023-01-20 09:56:29.578] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log terminate called recursively terminate called recursively [2023-01-20 09:56:29.578] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log [2023-01-20 09:56:29.578] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log [2023-01-20 09:56:29.578] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log [2023-01-20 09:56:29.578] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log [2023-01-20 09:56:29.578] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log [2023-01-20 09:56:29.578] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log [2023-01-20 09:56:29.578] [error] Killed after receive Aborted:SIGABRT(6) signal. Demangled backtrace dumped at ./kmtricks_backtrace.log. If the problem persists, please open an issue with the return of 'kmtricks infos' and the content of ./kmtricks_backtrace.log Command exited with non-zero status 1 Command being timed: "./kmtricks_build.sh" User time (seconds): 4460.10 System time (seconds): 29.69 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 1:14:51 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 55311744 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 6118703 Voluntary context switches: 12882 Involuntary context switches: 30427 Swaps: 0 File system inputs: 312 File system outputs: 42128 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 1 ```
kmtricks backtrace: ``` Backtrace: 1 0xa630202 0xa8297f3 0xa62fb24 0x420d335 0x41c4e06 0x98ec1a7 0x98ec858 0x98edd79 0x49117e10 0x4c833811 0x4c9ea312 0x42df0b13 0x4225dc14 0xa4ec6415 0xa4fef716 0x4253b1 ```

Any ideas? File limit should be fine. We have 1TB RAM at our expense and max resident size was only ~55 GB (see info) so that's not the problem.

Side question: Whats the difference between hash:bft:bin and hash:bf:bin? As the latter seems to be required by kmindex but the example I'm following recommended to use the former.

pierrepeterlongo commented 1 year ago

Hi

I think Téo will confirm, but my guess is that the issue comes from the file limit. You have 139 partitions * 25000 files. That is 3475000, which is higher than your (already high) ulimit.

About the difference between hash:bft:bin and hash:bf:bin (again wait for a formal confirmation by Téo).

tlemane commented 1 year ago

Hi,

The error seems to be related to the number of opened files. However, the first step (superk) should work with your configuration.

Recently I got some feedback from users who tried kmtricks to index genomes (tens of thousands of samples), leading to the identification of some issues in such a case:

  1. Max opened files. During the superk step, an approximation of the number of opened files is $t(1+p)$ where $p$ is the number of partitions and $t$ the number of threads (for you the problem occurs here, which is not expected with your configuration). A similar problem may occur during the merge step which opens about $t \times n$ files, where $n$ is the number of genomes. For the moment, the only workaround is to reduce the number of threads.
  2. Overestimation of the number of partitions. This does not impact the results but is problematic with respect to point 1. You can set the number of partitions with --nb-partitions.
  3. Unexpected memory usage. For a RefSeq dataset, the peak should never reach 55GB. However, this problem is probably not blocking for you.

I definitely have to fix that. Unfortunately, I don't know when I can do it.

In the meantime, I see two workarounds:

  1. Choose $p$ and $t$ values that work on your machine. Obviously not satisfactory.
  2. Build several sub-indexes and register them in a global index in kmindex, at the cost of an overhead at query time.

Sorry for the inconvenience.

Teo

smehringer commented 1 year ago

Hi there,

thanks for the response. Scaling down the number of threads from 32 to 16 worked for now.

What does

query "[name]40049279" contains no searchable smers

mean? (Query length is 250, kmer size 32)

Is the query not searchable at all?

pierrepeterlongo commented 1 year ago

Thats a strange behavior. Have you checked this particular query 40049279 ? Does it contain non ACGT characters?

smehringer commented 1 year ago

Sorry, my fault I think. I gave kmtricks a FASTQ file instead of FASTA (I noticed that only every 4th query did not have problems). Runnning it with FASTA again but it's taking quite some time. After about 2h I stopped the command because I noticed I haven't provided the threads option (ops), but it seems that 80 threads is the default. In htop it seemed that the whole time kmtricks was only using a single thread though.

Rerunning now with:

kmtricks query --run-dir ${KM_DIR}/kmtricks_index --query ${QUERY_FILE_FASTA} --threshold 0.7 --no-detail --threads 32 > ${KM_DIR}/kmtricks.result
[2023-01-27 12:05:30.478] [info] Run with Kmer<64> - __uint128_t implementation

Input is a 2.8 GB FASTA file with 10M queries of length 250.

Can you make an assumption on the expected runtime?

EDIT: Already an hour now with the above command, htop shows only a single thread being used and no output has been written yet. RAM usage is constantly at ~50G. Index size (full kmtrick_index directory) on disk is ~300G.