mummer4 / mummer

Mummer alignment tool
Artistic License 2.0
445 stars 107 forks source link

Nucmer issue: taking forever or alignment very fragmented #214

Open BioJules opened 2 months ago

BioJules commented 2 months ago

Hi people,

I hope this is the right place to ask my question. If not, than please let me know an alternative. :)

I am kind of new in the world of whole genome alignments. I wanted to create a simple circos plot for my poster. However, I already failed to generate a pairwise genome alignment with nucmer. Both mammalian genomes have a size of ~2.4 Gbp. I use mummer3.9.4alpha. I tried different things:

1) nucmer default settings (and -G): 240 scaffolds (non-repeat-masked) to each Chromosome separately -- took forever or stopped without a error message

2) nucmer default settings (and -G): 34 largest scaffolds (non-repeat-masked) to each Chromosome separately -- took forever or stopped without a error message

3) nucmer default settings (and -G): largest scaffold (non-repeat-masked) to each Chromosome separately -- took forever

4) nucmer with default settings (and -G): 241 repeat-masked scaffolds to each Chromosome separately -- fast BUT there are only short alignments listed in the delta file. Can it be due to the masked repeats in the alignment (=Ns).

5) I am playing with some nucmer setting right now, but I am not familiar with this. So it feels more like fishing in murky waters. a) numer -t 24 -G --mum -b 1000 -l 50 with the largest 17 scaffolds of the query and one chromosome as reference --> no results (empty delta) and the qstat told me that the job was canceled b) numer -t 24 -G --mum -b 1000 -l 50 with the 241 repeat-masked scaffolds of the query and one chromosome as reference --> still results in a very fragmented alignment which leads to a very scattered circos plot

I hope that someone can help me here. I appreciate any recommendation. BioJules

Prunoideae commented 3 weeks ago

If your nucmer is running for too long, you can try to set a lower batch size, e.g. --batch 1073741824 that only reads in 1Gbps for each batch.

I noticed that there's a strange performance degradation for nucmer to load very large genomes (in my case, a metagenome of 15Gbps), and limiting the batch size will effectively reduce the problem as only a proportion of the query is loaded for each time.