lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.79k stars 408 forks source link

Different Run Times Contigs vs. Chromosomes #146

Open malonge opened 6 years ago

malonge commented 6 years ago

Hi,

I have been using minimap2 for many genome to genome alignments and I have noticed an odd pattern.

If I take a large genome like a human genome and align (paf output) the whole chromosomes against the GRCh38 reference, it takes a few days.

But if I take that same human genome and break it down into smaller contigs, breaking the chromosomes at gaps for example, the alignment of those contigs against the reference only takes about 5 minutes.

This all with the kmer and window size set to 19, everything else default.

So my question is, is there any reason you can think of why a chromosome scale assembly would take much longer to map then the same assembly but contig-level?

Thanks

lh3 commented 6 years ago

Minimap2 has the best performance when contigs are aound ~10kb in length. However, a difference between days vs 5 minutes is unexpected. What "human genome" are you aligning?

malonge commented 6 years ago

So the one I am looking at currently is a little different than I described.

First, I align these MHAP human genome contigs to the GRCh38 reference and that takes about 5 minutes.

Then I basically use those mappings to tile those contigs into chromosomes (with padding of 100bp "N" between contigs). I take those tiled chromosomes and align them back to the reference and that takes a few days.

For example, the tiled chromosome 1 is about 275 Mbp long and the reference chromosome 1 is about 250 Mbp long. So lets say that I am mapping 23 (I think its a female sample) chromosomes against 24 reference chromosomes.

lh3 commented 6 years ago

That's weird. I have aligned chimp/macaque chromosomes to human and haven't observed such significant slow down. One thing to note is that aligning whole chromsomes takes a lot more memory. If you don't have enough memory, your system will spend most of time on swapping.

BTW, for CHM1, you should use the Falcon assembly. It is better.

malonge commented 6 years ago

Interesting I will take a look at the memory usage. Thanks for the help.

And thanks for the advice for another CHM1 assembly.

soungalo commented 2 years ago

I am also experiencing very slow runs with chromosomes-to-chromosomes mapping. I see this is a rather old issue - are there any updates on that? I don't see memory/swapping problems on my machine. Are there any suggestions other than breaking the chromosomes to ontigs?