Open minjeongjj opened 1 year ago
What is your reference genome? Why there are so many sequences and the total length is very long? It seems that the genome is too large for Chromap to handle.
Same problem here with a 9Gb genome. What is the limit of Chromap? Could some parameters be changed to improve this?
PS: Had no problem with a 5GB genome before ...
@HMPNK What is the longest chromosome of the 9GB genome?
@mourisl Total: 8834612447 Count: 2159 Average: 4091992.80 Median: 73976 N00: 123690798 1 N10: 78885008 10 N20: 58174874 23 N30: 48773516 40 N40: 37269988 61 N50: 29511668 87 N60: 23255327 122 N70: 17203678 165 N80: 11905554 227 N90: 6556291 320 N100: 4315 2159
Did you get the same error message? Your genome is large and I guess it has more than 2^32-1 minimizers. If this is the case, it will require some code change to support very large genome.
the error code was slightly different:
chromap: src/index.cc:33: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertion `num_minimizers <= static_cast
It collected 3.350.432.716 minimizers which is less than the maximum 2^32-1.
I just checked. The max number of minimizers currently supported by Chromap is 2^31 - 1 instead of 2^32 - 1. So it would require some code change before Chromap can support large genomes like what you have.
Could you provide that changes? Are there other possibilities, like changing kmer_size and window
I just did a test and using "-w13" worked. Increasing -w efficiently reduces number of minimizers. But I guess increasing "-w" will reduce sensitivity of mapping? What do you think?
I just did a test and using "-w13" worked. Increasing -w efficiently reduces number of minimizers. But I guess increasing "-w" will reduce sensitivity of mapping? What do you think?
What is your read length? If your read is long, increase w probably won't affect the accuracy much.
It is 2 times 150bp,
I think 150bp should be fine to handle "-w 13". Since your genome is large, you can increase "-k" a little bit to ensure each minimizer is unique enough on the genome, maybe -k 23 -w 17. Then you will have 3 non-overlap windows to locate seeds. The default parameter was selected for 50bp scATAC-seq data.
@haowenz Is this reasonable?
The fragment size can still be short though. Currently, increasing w is probably the only way to use large genome. It may affect sensitivity, but probably not much as you only increase it by 3 and the k-mer size doesn't change. For long term, we should support a larger number of minimizers.
getting same issue for Axolotl genome which is even bigger around 27G. Do you think it will be possible to address this any time soon?
Any suggestions for -k and -w parameters? I have R1 50bp and R2 60bp bulk ATAC-seq. Setting -w 13 still fails.
You may try keep k-mer length at 17 (-k 17) and increase window size to 13 (-w 13) and even larger to see if it works.
-w 24 seems to be the smallest window size that works for this genome, and I'm getting an alignment rate of around 50% with that. That likely suggests that 24 is too large? Will probably need to benchmark against another aligner.
-w 24 seems to be the smallest window size that works for this genome, and I'm getting an alignment rate of around 50% with that. That likely suggests that 24 is too large? Will probably need to benchmark against another aligner.
That's possible. Can you post more numbers here? It is also possible that the genome is repetitive and lots of multi-mappings are filtered out.
Number of reads: 1204478674. Number of mapped reads: 824244368. Number of uniquely mapped reads: 694616494. Number of reads have multi-mappings: 129627874. Number of candidates: 12259974313. Number of mappings: 824244368. Number of uni-mappings: 694616494. Number of multi-mappings: 129627874.
Closer to 70% with multi-mappers. Uniquely mapped read-pairs (lines in the output file) is 288M, so closer to 48%.
I'll check bowtie2. The index is taking a long time to prepare.
thanks for the numbers. You may try bowtie2. But it should be even slower to build an FM-index.
Same problem here with an about 9Gb genome, when I map Hi-C short reads, as follows. How can I solve it?
Build index for the reference.
Kmer length: 17, window size: 7
Reference file: ref.fasta
Output file: ref.index
Loaded all sequences successfully in 194.21s, number of sequences: 5791, number of bases: 8786216834.
Collecting minimizers.
Collected 2217946008 minimizers.
Sorting minimizers.
Sorted all minimizers.
chromap: src/index.cc:33: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertion `num_minimizers <= static_cast
@Biscuite-wzy You can increase k-mer length (-k) and window size (-w) values a bit to see whether it works. How long is the longest chromosome in your genome?
Hi, I increase window size (-w) values it works well.
787117923 @.***
------------------ 原始邮件 ------------------ 发件人: "haowenz/chromap" @.>; 发送时间: 2023年11月13日(星期一) 中午11:51 @.>; @.**@.>; 主题: Re: [haowenz/chromap] Assertion `num_minimizers <= static_cast<size_t>(INT_MAX)' failed (Issue #131)
@Biscuite-wzy You can increase k-mer length (-k) and window size (-w) values a bit to see whether it works. How long is the longest chromosome in your genome?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
is there any fix for this problem? I run into the same issue when using axolotl genome, and have to set window size with greater number (-w 31) to build genome index.
however, I suppose that the large window size would not be suitable for me as I have dataset from different species genome which is generated using the default parameters. so here just want to know if there is any update?
Besides tuning the parameters, there is no easy fix on top of the current Chromap codebase to support a very huge genome. We plan to see if this is possible in the near future.
Hi, was this issue fixed in the recent version of Chromap (0.2.6) ? I have genomes of 18Gb and 26Gb.
Build index for the reference. Kmer length: 17, window size: 7 Collected 3812725164 minimizers. Sorted minimizers. chromap: src/index.cc:178: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertion
num_minimizers != 0 && num_minimizers <= 0x7fffffff' failed.
.command.sh: line 7: 239015 Aborted`
What would be the best -k and -w for genomes of this size?
Thank you.
This has not been fixed. I think the best way to handle this is to use a standard value for k, but increase the value for -w.
This has been fixed. I think the best way to handle this is to use a standard value for k, but increase the value for -w.
Hi, thank you for the reply. I am currently using version 0.2.1. So I guess, updating the version would be help to resolve this. Can I use -w 20 for this size genome ?
Sorry, I made a typo..it has NOT been fixed..
Sorry, I made a typo..it has NOT been fixed..
Oh! Then probably using 0.2.6 won't solve it. I will try to increase value for -w and check.
Hello,
I want to run chromap using my genome file
But, coredumped went out
Here is the log file and command
Command $chromap -i -r Combined_pseudohap.phased.filtered.0.arcs.fasta -o chromap.index -t 100 >chromap.index.log 2>chromap.index.log2
log file Build index for the reference. Kmer length: 17, window size: 7 Reference file: Combined_pseudohap.phased.filtered.0.arcs.fasta Output file: chromap.index Loaded all sequences successfully in 156.47s, number of sequences: 41577, number of bases: 19811410511. Collecting minimizers. Collected 4958576388 minimizers. Sorting minimizers. Sorted all minimizers. chromap: src/index.cc:33: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertion `num_minimizers <= static_cast(INT_MAX)' failed.
Are there any comments to figure out?
Best wishes,