haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
189 stars 20 forks source link

Wheat genome create index failed #9

Open fei0810 opened 3 years ago

fei0810 commented 3 years ago

Hi, I got the following error when creating the wheat genome index

Build index for the reference.
Kmer length: 17, window size: 7
Reference file: 161010_Chinese_Spring_v1.0_pseudomolecules.fasta
Output file: wheat_chromap_chipseq.index
Loaded all sequences successfully in 59.10s, number of sequences: 22, number of bases: 14547261565.
Collected 3566471558 minimizers.
Sorted minimizers.
chromap: src/index.cc:134: void chromap::Index::Construct(uint32_t, const chromap::SequenceBatch&): Assertion `num_minimizers != 0 && num_minimizers <= INT_MAX' failed

Because the reference genome of wheat is very large(~ 17 Gb), tools that can improve the speed of mapping are valuable for wheat. The cause of the error is most likely that the wheat reference genome of wheat is too large. This problem is also common when using other software for wheat genomes. However, BWA build index does not report errors.

Wheat genome download site: http://plants.ensembl.org/Triticum_aestivum/Info/Index

haowenz commented 3 years ago

Can you tell the downstream or applications you target? For now Chromap is mainly optimized to work on chromatin profiles data. So more use cases are supposed to be on human or mouse genomes.

fei0810 commented 3 years ago

Can you tell the downstream or applications you target? For now Chromap is mainly optimized to work on chromatin profiles data. So more use cases are supposed to be on human or mouse genomes.

ChIP-seq, ATAC-seq, and Hi-C in wheat have now also generated a lot of data, and epigenetic studies in wheat are also very important.

mourisl commented 3 years ago

One possible solution is to use larger -k and -w value. With larger -w, less minimizes are created for index. Maybe we can try something like -k 27 -w 14. The default -k/-w, Chromap created 3,566,471,558, which is about 1.5 times INT_MAX. Maybe -w 14 could reduce the number below the threshold. Using larger -k could make each minimizer more unique for this large genome, which could help in downstream alignment.

Though the alignment accuracy might be suboptimal depending on your read length, the index building might work.

fei0810 commented 3 years ago

One possible solution is to use larger -k and -w value. With larger -w, less minimizes are created for index. Maybe we can try something like -k 27 -w 14. The default -k/-w, Chromap created 3,566,471,558, which is about 1.5 times INT_MAX. Maybe -w 14 could reduce the number below the threshold. Using larger -k could make each minimizer more unique for this large genome, which could help in downstream alignment.

Though the alignment accuracy might be suboptimal depending on your read length, the index building might work.

Thanks for the suggestion, I'll test it with a larger k value