haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
192 stars 21 forks source link

Assertion `num_minimizers <= static_cast<size_t>(INT_MAX)' failed #131

Open minjeongjj opened 1 year ago

minjeongjj commented 1 year ago

Hello,

I want to run chromap using my genome file

But, coredumped went out

Here is the log file and command

Command $chromap -i -r Combined_pseudohap.phased.filtered.0.arcs.fasta -o chromap.index -t 100 >chromap.index.log 2>chromap.index.log2

log file Build index for the reference. Kmer length: 17, window size: 7 Reference file: Combined_pseudohap.phased.filtered.0.arcs.fasta Output file: chromap.index Loaded all sequences successfully in 156.47s, number of sequences: 41577, number of bases: 19811410511. Collecting minimizers. Collected 4958576388 minimizers. Sorting minimizers. Sorted all minimizers. chromap: src/index.cc:33: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertion `num_minimizers <= static_cast(INT_MAX)' failed.

Are there any comments to figure out?

Best wishes,

haowenz commented 1 year ago

What is your reference genome? Why there are so many sequences and the total length is very long? It seems that the genome is too large for Chromap to handle.

HMPNK commented 1 year ago

Same problem here with a 9Gb genome. What is the limit of Chromap? Could some parameters be changed to improve this?

PS: Had no problem with a 5GB genome before ...

mourisl commented 1 year ago

@HMPNK What is the longest chromosome of the 9GB genome?

HMPNK commented 1 year ago

@mourisl Total: 8834612447 Count: 2159 Average: 4091992.80 Median: 73976 N00: 123690798 1 N10: 78885008 10 N20: 58174874 23 N30: 48773516 40 N40: 37269988 61 N50: 29511668 87 N60: 23255327 122 N70: 17203678 165 N80: 11905554 227 N90: 6556291 320 N100: 4315 2159

haowenz commented 1 year ago

Did you get the same error message? Your genome is large and I guess it has more than 2^32-1 minimizers. If this is the case, it will require some code change to support very large genome.

HMPNK commented 1 year ago

the error code was slightly different:

chromap: src/index.cc:33: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertion `num_minimizers <= static_cast(0x7fffffff)' failed.

It collected 3.350.432.716 minimizers which is less than the maximum 2^32-1.

haowenz commented 1 year ago

I just checked. The max number of minimizers currently supported by Chromap is 2^31 - 1 instead of 2^32 - 1. So it would require some code change before Chromap can support large genomes like what you have.

HMPNK commented 1 year ago

Could you provide that changes? Are there other possibilities, like changing kmer_size and window

HMPNK commented 1 year ago

I just did a test and using "-w13" worked. Increasing -w efficiently reduces number of minimizers. But I guess increasing "-w" will reduce sensitivity of mapping? What do you think?

mourisl commented 1 year ago

I just did a test and using "-w13" worked. Increasing -w efficiently reduces number of minimizers. But I guess increasing "-w" will reduce sensitivity of mapping? What do you think?

What is your read length? If your read is long, increase w probably won't affect the accuracy much.

HMPNK commented 1 year ago

It is 2 times 150bp,

mourisl commented 1 year ago

I think 150bp should be fine to handle "-w 13". Since your genome is large, you can increase "-k" a little bit to ensure each minimizer is unique enough on the genome, maybe -k 23 -w 17. Then you will have 3 non-overlap windows to locate seeds. The default parameter was selected for 50bp scATAC-seq data.

@haowenz Is this reasonable?

haowenz commented 1 year ago

The fragment size can still be short though. Currently, increasing w is probably the only way to use large genome. It may affect sensitivity, but probably not much as you only increase it by 3 and the k-mer size doesn't change. For long term, we should support a larger number of minimizers.

suragnair commented 1 year ago

getting same issue for Axolotl genome which is even bigger around 27G. Do you think it will be possible to address this any time soon?

Any suggestions for -k and -w parameters? I have R1 50bp and R2 60bp bulk ATAC-seq. Setting -w 13 still fails.

haowenz commented 1 year ago

You may try keep k-mer length at 17 (-k 17) and increase window size to 13 (-w 13) and even larger to see if it works.

suragnair commented 1 year ago

-w 24 seems to be the smallest window size that works for this genome, and I'm getting an alignment rate of around 50% with that. That likely suggests that 24 is too large? Will probably need to benchmark against another aligner.

haowenz commented 1 year ago

-w 24 seems to be the smallest window size that works for this genome, and I'm getting an alignment rate of around 50% with that. That likely suggests that 24 is too large? Will probably need to benchmark against another aligner.

That's possible. Can you post more numbers here? It is also possible that the genome is repetitive and lots of multi-mappings are filtered out.

suragnair commented 1 year ago

Number of reads: 1204478674. Number of mapped reads: 824244368. Number of uniquely mapped reads: 694616494. Number of reads have multi-mappings: 129627874. Number of candidates: 12259974313. Number of mappings: 824244368. Number of uni-mappings: 694616494. Number of multi-mappings: 129627874.

Closer to 70% with multi-mappers. Uniquely mapped read-pairs (lines in the output file) is 288M, so closer to 48%.

I'll check bowtie2. The index is taking a long time to prepare.

haowenz commented 1 year ago

thanks for the numbers. You may try bowtie2. But it should be even slower to build an FM-index.

Biscuite-wzy commented 1 year ago

Same problem here with an about 9Gb genome, when I map Hi-C short reads, as follows. How can I solve it?

Build index for the reference. Kmer length: 17, window size: 7 Reference file: ref.fasta Output file: ref.index Loaded all sequences successfully in 194.21s, number of sequences: 5791, number of bases: 8786216834. Collecting minimizers. Collected 2217946008 minimizers. Sorting minimizers. Sorted all minimizers. chromap: src/index.cc:33: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertion `num_minimizers <= static_cast(0x7fffffff)' failed

mourisl commented 1 year ago

@Biscuite-wzy You can increase k-mer length (-k) and window size (-w) values a bit to see whether it works. How long is the longest chromosome in your genome?

Biscuite-wzy commented 1 year ago

Hi, I increase window size (-w) values it works well.

787117923 @.***

 

------------------ 原始邮件 ------------------ 发件人: "haowenz/chromap" @.>; 发送时间: 2023年11月13日(星期一) 中午11:51 @.>; @.**@.>; 主题: Re: [haowenz/chromap] Assertion `num_minimizers <= static_cast<size_t>(INT_MAX)' failed (Issue #131)

@Biscuite-wzy You can increase k-mer length (-k) and window size (-w) values a bit to see whether it works. How long is the longest chromosome in your genome?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

lskfs commented 10 months ago

is there any fix for this problem? I run into the same issue when using axolotl genome, and have to set window size with greater number (-w 31) to build genome index.

however, I suppose that the large window size would not be suitable for me as I have dataset from different species genome which is generated using the default parameters. so here just want to know if there is any update?

haowenz commented 9 months ago

Besides tuning the parameters, there is no easy fix on top of the current Chromap codebase to support a very huge genome. We plan to see if this is possible in the near future.

afiyachida commented 1 week ago

Hi, was this issue fixed in the recent version of Chromap (0.2.6) ? I have genomes of 18Gb and 26Gb.

Build index for the reference. Kmer length: 17, window size: 7 Collected 3812725164 minimizers. Sorted minimizers. chromap: src/index.cc:178: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertionnum_minimizers != 0 && num_minimizers <= 0x7fffffff' failed. .command.sh: line 7: 239015 Aborted`

What would be the best -k and -w for genomes of this size?

Thank you.

mourisl commented 1 week ago

This has not been fixed. I think the best way to handle this is to use a standard value for k, but increase the value for -w.

afiyachida commented 1 week ago

This has been fixed. I think the best way to handle this is to use a standard value for k, but increase the value for -w.

Hi, thank you for the reply. I am currently using version 0.2.1. So I guess, updating the version would be help to resolve this. Can I use -w 20 for this size genome ?

mourisl commented 1 week ago

Sorry, I made a typo..it has NOT been fixed..

afiyachida commented 1 week ago

Sorry, I made a typo..it has NOT been fixed..

Oh! Then probably using 0.2.6 won't solve it. I will try to increase value for -w and check.