refresh-bio / KMC

Fast and frugal disk based k-mer counter
266 stars 73 forks source link

v3.2.4 stalls on very large dataset but v3.2.1 does not #238

Open peter-kanvas opened 3 months ago

peter-kanvas commented 3 months ago

I have a workflow which uses kmc to count all kmers in an extremely large dataset of about 250,000 fasta files. The workflow was originally built with v3.2.1 of kmc, but stalled when I updates to v3.2.4. Unfortunately it doesn't exit or report an error. Here's the details I can provide:

KMC call:

kmc -fm -ci0 -cx100000000000 -t94 -k75 -m745 @reference_list database databse_dir

Result:

The program spends some time printing * characters, and then it prints Stage 1: 0% before stalling. There are 511 bin files in the workdir. Htop shows no processor activity, but the commands are still listed.

Before changing versions, i spent time trying to make sure that none of the fasta.gz files were corrupted.

marekkokot commented 3 months ago

Hello,

this sounds bad. Is this data anyhow downloadable, such that I could try to reproduce this bug?

Some ideas you could try to narrow:

I would really like to fix it because it seems to be quite a serious bug, but without reproducing this, it may be really challenging.

peter-kanvas commented 3 months ago

The data is publicly available. They are all the genomes I could collect from the gtdb database via NCBI. I'm attaching two lists. One is the ftp links I used to download all the genomes. They may or may not still be valid. The other is a subset of the genomes that I used when I encountered the error. You'll need about 351G of space to download all the genomes, and the final database ends up being about 4 TB. I'm working on an AWS EC2 instance (r5a.24xlarge) running AWS Linux 2023. KMC was installed using mamba, and the call was made from within a snakemake pipeline which I cannot share.

I've already moved passed the problem and have to get to the downstream analysis. I'll try the changes you suggested the next time I run this pipeline (likely in a few weeks).

reference_genome_list_kan002_v3.txt.gz gtdb_in_genbank_ftp_links.txt.gz