refresh-bio / KMC

Fast and frugal disk based k-mer counter
252 stars 73 forks source link

munmap_chunk(): invalid pointer on ubam file - KMC crashes on BAM with long reads #230

Open tbenavi1 opened 4 months ago

tbenavi1 commented 4 months ago

Hello,

I am running the following KMC command:

kmc -k31 -t42 -m50 -sm -ci1 -cs100000000 -fbam data.u.bam data tmp

and I receive the following error at the end of stage 1:

Stage 1: 100%
munmap_chunk(): invalid pointer

I was wondering if you knew how to fix this? Is there an email where I can send the example file that causes this issue? I don't want to share the data publicly. Thank you.

marekkokot commented 4 months ago

This bug is a little nasty because it is a consequence of some assumptions we made developing the first versions of KMC, i.e., that reads are short. Later, we added support for long reads but still need to add this for bam file format. Because this bug is quite complex, I don't know how fast we can fix it. So, for now, the best option is to convert BAMs to fastq with samtools and use it as kmc input.

marekkokot commented 4 months ago

For future me:

it seems there is a buffer overflow in skipSingleBGZFBlock.

tbenavi1 commented 2 months ago

Hello, I have a followup issue/question. We are trying to save space on our cluster, so if possible I would not like to have to save the output when converting BAMS to fasta/q. So I tried to run KMC with bash process substitution. For example,

kmc -k31 -t42 -m50 -sm -ci1 -cs100000000 -fm <(samtools fasta -@ 42 file.ubam) db tmp

However, I get the error:

Error: Error: /dev/fd/63 is not a file

which I believe comes from https://github.com/refresh-bio/KMC/blob/65bff733bc6487e33f04ff134da50e6b7cb3031f/kmc_core/binary_reader.h#L352

Is there any way to update KMC to allow it to take process substitution as input? Thanks for any information.

marekkokot commented 2 months ago

This is a little more complex. KMC reads input files twice. The first time, only a very small portion of it for adjustments for better balancing. After this file is closed and reopened for real processing. This makes KMC unsuitable for streaming/pipe processing :( We know this is quite a limitation, and we will do our best to make KMC work in pipe mode in the future.

We have this unstable branch here: https://github.com/refresh-bio/KMC/tree/experimental/stbm In this branch there is a parameter -sss if you set it to -sssmin_hash it should work. For example, it seems to work:

bin/kmc -sssmin_hash -k27 <(cat in.fq) o .

Keep in mind that this branch is not production-ready. We use it for testing and experiments, and it may disappear at some point. I'm also not sure about its performance etc. I think it should just work, so you may try it. I am not sure if the strict memory mode (-sm that you use) works fine on this branch. but 50GB (-m50) should be fine without this parameter. If you spot any issues let me know, although I am not sure when we will fix them. Let me know if you will try with this :)

Edit: Also, on this branch, kmc_tools may not work if you use -sssmin_hash. I don't remember the details now.