lh3 / minigraph

Sequence-to-graph mapper and graph generator
https://lh3.github.io/minigraph
MIT License
396 stars 38 forks source link

[morecore] insufficient memory #7

Open fbemm opened 4 years ago

fbemm commented 4 years ago

Dear Heng,

I am trying to build a rgfa from two large genomes (>10Gb). They are both finished down to chromosomes. I get the following output:

[M::main::23.960*1.00] loaded the graph from "reference.fasta"
[M::mg_index::380.806*1.79] indexed the graph
[M::mg_opt_update::393.356*1.76] occ_weight=20, occ_max1=2182; 95 percentile: 7
[M::ggen_map::426.277*1.71] loaded file "query.fasta"
[morecore] insufficient memory

If I use contigs instead of chromosomes as query it works. Some chromosomes are >800Mb. Is that happening because minigraph reads all query bases in the graph construction mode?

I rerun the chromosome vs. chromosome attempt with --no-kalloc, as expected minigraph slows down but it does not break any more, at least until now.

Thx, Felix

fbemm commented 4 years ago

minimap2 shows the same behaviour when running chromosomes vs. chromosomes with one of the "asmX" presets. Just starts to consume memory after it reports on distinct minimizer. Same happens if I align two 600Mb sequences against each other.

pjm43 commented 4 years ago

Has there been any update on this issue? I'm using minimap2 for purge_dups which requires a self-self alignment - but I also get a core dump (see below) similar to that reported here. It's a large genome (12Gb) with large contigs:

minimap2 -I 12G -x asm5 -DP -t 128 $assembly.split $assembly.split

Slurm output:

[M::mm_idx_gen::159.610*1.42] collected minimizers [M::mm_idx_gen::178.178*3.12] sorted minimizers [M::main::178.178*3.12] loaded/built the index for 35023 target sequence(s) [M::mm_mapopt_update::185.626*3.04] mid_occ = 2283 [M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 35023 [M::mm_idx_stat::189.135*3.00] distinct minimizers: 251026274 (67.18% are singletons); average occurrences: 4.737; average spacing: 9.989 [morecore] insufficient memory

Thanks, Jeff

pjm43 commented 4 years ago

I should have noted in my question above that I am running this on 2Tb node and I have tried it with differing number of threads from 2 -128. They all result in a core dump. I'm running on minimap2 2.17-r941.

Jeff

ekg commented 3 years ago

If this is still an issue, you might explore using the mashmap-based edyeet and follow with seqwish and then smoothxg for the graph induction and postprocessing. The graph will have different properties than a minigraph one (it will include small variants and collapsed regions larger than the smoothing window size / local path depth). It might not be good for your application, but it may be worth considering if you are stuck.

If minigraph could read in approximate (without cigar) PAF from an external source then edyeet might be way to guide the progressive assembly in minigraph. That might require significant changes to minigraph.

edyeet started as a workaround for issues that I encountered when making whole genome alignments with minimap2 that appear to be very similar to what you're describing. It appears that long repetitive sequences can cause high memory consumption during alignment and very long alignment times. I saw this when working with all-to-all alignment of ~50mb set of sequences from a human satellite array. My impression is that the minimizer seed chaining DAG gets very large in these cases, but this could be incorrect as I have not profiled. The mashmap algorithm does not have these issues. In any case, I would be curious about how to mitigate them if @lh3 has any suggestions.

lh3 commented 3 years ago

Minigraph is slow here mostly due to the high k-mer occurrences. Minimap2 would be slow as well. The solution is to reduce the k-mer threshold for example by adding -f.1. For typical datasets, this effectively asks minigraph to use k-mers occurring less than 100 times. That said, minigraph wouldn't work well with satellites anyway. TandemMapper/Winnomap like strategies are needed.

Minigraph aims to map through long segdups. In my view, that is the point of pangenome graphs. Minimap2 is failing here. That is why I am experimenting new chaining algorithms in minigraph. I am not sure how mashmap2 works. In addition, I wouldn't recommend to use edlib for graph construction because edit-distance based alignment is not evolutionarily meaningful.

ekg commented 3 years ago

Minigraph aims to map through long segdups. In my view, that is the point of pangenome graphs. Minimap2 is failing here. That is why I am experimenting new chaining algorithms in minigraph.

This makes sense, but I think it should be evaluated with applications. I still don't feel I understand the tradeoffs.

I am curious if minigraph is able to distinguish between similar copies of a repeat. It would seem difficult to obtain an optimal alignment between segdups if the rate of divergence between copies of the CNV is low. If I align a longer copy of a CNV to a shorter copy, how does minigraph decide which specific allele was relatively inserted? Would the chaining run through until the end of the CNV, marking the last part of the longer CNV as a relative insertion? Or would it be able to find the specific copies of the CNV that are inserted?

I am not sure how mashmap2 works.

At a high level, it's using mash distances to find approximate alignment candidates of a given minimum length and identity threshold between a query and target. I think it would be very interesting to use something like this as input to progressive construction with minigraph. The complexity of computing the approximate mapping doesn't blow up in the case of repeats as with the chaining DAG model.

In addition, I wouldn't recommend to use edlib for graph construction because edit-distance based alignment is not evolutionarily meaningful.

The lack of affine gap model is a problem, but how is this an issue? I understand that any pairwise alignment could suffer from a lack of evolutionary significance, or in other words would produce somewhat arbitrary alignments that don't reflect the parsimonious mutational processes that we'd expect in a given context.

The smoothing step I've mentioned is an experimental way to address this by running local bundles of sequences that are collinear in the raw graph through an MSA. Right now it's using a relatively crude MSA (simple POA with affine gaps), but this could in principle be improved with another more expensive model.

lh3 commented 3 years ago

If I align a longer copy of a CNV to a shorter copy, how does minigraph decide which specific allele was relatively inserted? Would the chaining run through until the end of the CNV, marking the last part of the longer CNV as a relative insertion? Or would it be able to find the specific copies of the CNV that are inserted?

You can align through a segdup if there are long enough flanking sequences. At least in common cases, different copies of duplications can be identified by basepair differences between copies –– this is how HiCanu/hifiasm works. There is one true alignment. If the basepair difference is large enough, minigraph can find the right one.

I know the basic ideas of mashmap2. What I am not sure is how it handles very long indels. I guess it can't align through. Either way, fragmented alignments like minimap2 is a bad fit for me.

No scoring schemes can always produce evolutionally meaningful alignment. Edit distance is just the worst. In comparison to concave gap costs used in minimap2 and ngmlr, edit distance much more often fragments one long event into short pieces, making downstream annotation and interpretation harder. It is pretty much useless for SV calling.

ekg commented 3 years ago

How wide of a bandwidth can you run in minimap2? Wouldn't you have the problem of aligning through larger SVs with the adaptive banded alignment? My impression was that you often tend to get two separate mappings on either side of an SV because the chains won't connect. This is a bit afield for this repo so I can repost some of these questions on the minimap2 repo if you prefer.

FWIW, I'm not seeing major issues with edlib alignments and larger indels. Do you have some cases that you'd suggest trying?

ekg commented 3 years ago

So for edyeet/mashmap2, you could specify a segment length (say 10kb) and an identity limit. If it's 75%, then you could find approximately 2.5kb of indel with that single mapping. This is approximate but fast. In effect you can get somewhat ragged borders to the alleles. But if the sequences are all accessible in the output variation graph then it's easy enough to run an MSA locally over the graph to clean up the breakpoints (that's smoothxg).

lh3 commented 3 years ago

Aligning large gaps is very difficult. My understanding is that mashmap2 seeks the boundaries of alignments. I doubt it can align through large gaps. In practice, mashmap2 does produce more fragmented mappings than minigraph. Whether MSA can fix the problem caused by edit distance depends on the scoring scheme of MSA and the way you trigger MSA. I have give you the assemblies and the regions. Try those.