tlemane / kmtricks

modular k-mer count matrix and Bloom filter construction for large read collections
GNU Affero General Public License v3.0
72 stars 7 forks source link
bloom-filters count kmer matrix

kmtricks

License kmtricks release dockerhub anaconda

kmtricks is a modular tool suite for counting kmers, and constructing Bloom filters or kmer matrices, for large collections of sequencing data.

Citation

Lemane, T., Medvedev, P., Chikhi, R., & Peterlongo, P. (2022). kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections. Bioinformatics Advances.

Rationale

kmtricks is optimized for the analysis of multiple FASTA/FASTQ files (gzipped or not). It features:

Note: for counting kmers from a single file, kmtricks works but is slightly slower than a traditional k-mer counter (e.g. KMC). It is really optimized for merging count information across multiple samples, which traditional k-mer counters cannot do.

Overview

Input: a set of read sets in FASTA or FASTQ format, gzipped or not.

Final output is either:

Installation and usage

Instructions for installation and usage are provided in the wiki.

Limitations

kmtricks needs disk space to run. The disk usage is variable and depends on data, parameters and output format. Based on our observations, the required space is between 20% of the total input size (gzipped) and the total input size (including outputs).

Reporting an issue

If you encounter a problem, please open an issue with a description of your run and the return of kmtricks infos. If you encounter a critical error like a segmentation fault, kmtricks automatically dumps a file kmtricks_backtrace.log in your current directory. This file is somewhat illegible in release mode. If you can, compile kmtricks in debug mode, launch it again and join the content of this file. If you cannot directly compile kmtricks on your system, the conda package provides kmtricks-debug binary for this case.

Reference

T. Lemane, P. Medvedev, R. Chikhi and P. Peterlongo, "kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections." Bioinformatics Advances, 2022, doi:10.1093/bioadv/vbac029.

@article{kmtricks,
    author = {Lemane, Téo and Medvedev, Paul and Chikhi, Rayan and Peterlongo, Pierre},
    title = "{kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections}",
    journal = {Bioinformatics Advances},
    year = {2022},
    doi = {10.1093/bioadv/vbac029},
    url = {https://doi.org/10.1093/bioadv/vbac029},
}

Contacts

Téo Lemane: teo[dot]lemane[at]proton[dot]me\ Rayan Chikhi: rayan[dot]chikhi[at]pasteur[dot]fr\ Pierre Peterlongo: pierre[dot]peterlongo[at]inria[dot]fr