what(): std::bad_alloc (improve memory usage patterns)

pangenome / smoothxg

linearize and simplify variation graphs using blocked partial order alignment

Other

57 stars 7 forks source link

what(): std::bad_alloc (improve memory usage patterns) #6

Closed ekg closed 4 years ago

ekg commented 4 years ago

I'm running into problems like this (I have only 16GB of RAM on my test machine, so it's pretty easy to run out of memory).

[smoothxg::smoothable_blocks] computing blocks
[smoothxg::smoothable_blocks] computing blocks 100.00%                                         
terminate called after throwing an instance of 'std::bad_alloc'.958%                           
  what():  std::bad_alloc             
Command terminated by signal 6

It looks to me that I'm usually running out of memory during the spoa generation step. This can be mitigated somewhat by just running with fewer threads. But, it seems reasonable to be able to tell when to bail out on a particular alignment problem and reformulate it. One option would be to detect memory allocations problems and then break the spoa block in pieces. The other would be to try to calculate how big the problems should be during the block determination phase and prevent them from ever getting too big. This would ideally be a user-configurable thing, so that the results could be reproducible. (It's worth bearing in mind that the prep step is stochastic and won't always yield the same graph, so smoothxg is inherently inconsistent.)

brettChapman commented 4 years ago

Hi Erik

I'm also getting a std:bad_alloc error, and I'm running on a 64GB machine.

This was running on a GFA from two chromosomes, that had been produced by edyeet and seqwish, with segment length 50,000, p75, a75, n20 and k16 (I'm going to also try with s100,000 and k11)

I'll try running it on another machine with more RAM (252GB, as our 500GB machine is currently down with problems). However, I'm getting these memory issues with only two chromosomes, I have my doubts about it scaling well up to multiple genomes, even with 500GB RAM.

I'll also try again, as your comment suggests to reduce the number of threads, as I was using the full 16 cores. I'll try two cores and see how it goes before having to transfer the files to another machine with more RAM to run.

topological sort 118986178 of 118986178 ~ 100.0000% [smoothxg::smoothable_blocks] computing blocks [smoothxg::smoothable_blocks] computing blocks 100.00%% terminate called after throwing an instance of 'std::bad_alloc'29 18.845% what(): std::bad_alloc srun: error: node-14: task 0: Aborted (core dumped)

brettChapman commented 4 years ago

Just for reference, my GFA file was 3.8GB in size and it's PAF file was 87Mb.

I also ran edyeet with s=10000, p=85, a=90, k=16, and n=10, resulting in a smaller 37Mb PAF file, and 1.5GB GFA file.

Smoothxg ran successfully with the smaller GFA file, however the resulting graph layout was a mess after runnging odgi viz and odgi draw with the following commands:

odgi build -g Morex_v1_5H_vs_Morex_v2_5H.smooth.gfa -o Morex_v1_5H_vs_Morex_v2_5H.og
odgi viz -i Morex_v1_5H_vs_Morex_v2_5H.og -o Morex_v1_5H_vs_Morex_v2_5H.viz.png -x 1500 -y 500 -P 5
odgi chop -i Morex_v1_5H_vs_Morex_v2_5H.og -c 100 -o Morex_v1_5H_vs_Morex_v2_5H.chop.og
odgi layout -i Morex_v1_5H_vs_Morex_v2_5H.chop.og -o Morex_v1_5H_vs_Morex_v2_5H.chop.og.lay -t 16 -P
odgi draw -i Morex_v1_5H_vs_Morex_v2_5H.chop.og -c Morex_v1_5H_vs_Morex_v2_5H.chop.og.lay -p Morex_v1_5H_vs_Morex_v2_5H.chop.og.lay.png -H 1000 -t 16

My graph with s=10000, p=85, a=90, k=16, and n=10:

Morex_v1_5H_vs_Morex_v2_5H_s10000_p85_a90_n10 chop og lay_cropped

ekg commented 4 years ago

I suspect this is specifically about the scalability of a single alignment problem, and not the total process. I have a test case locally which causes the same issue with a two even smaller graphs. Tracing it down, it appears to occur deterministically in the spoa code, and not due to an out of memory condition. The allocator tries to get some enormous amount of memory and that throws the error. Today, I'll check what input is generating this problem. Accidentally feeding empty strings to spoa caused this in the past.

This is unrelated, but to get a cleaner graph you will probably need to greatly increase -s to edyeet. I would test at 100kb.

On Fri, Sep 18, 2020, 06:47 Brett Chapman notifications@github.com wrote:

Just for reference, my GFA file was 3.8GB in size and it's PAF file was 87Mb.

I also ran edyeet with s=10000, p=85, a=90, k=16, and n=10, resulting in a smaller 37Mb PAF file, and 1.5GB GFA file.

Smoothxg ran successfully with the smaller GFA file, however the resulting graph layout was a mess after runnging odgi viz and odgi draw with the following commands:

odgi build -g Morex_v1_5H_vs_Morex_v2_5H.smooth.gfa -o Morex_v1_5H_vs_Morex_v2_5H.og odgi viz -i Morex_v1_5H_vs_Morex_v2_5H.og -o Morex_v1_5H_vs_Morex_v2_5H.viz.png -x 1500 -y 500 -P 5 odgi chop -i Morex_v1_5H_vs_Morex_v2_5H.og -c 100 -o Morex_v1_5H_vs_Morex_v2_5H.chop.og odgi layout -i Morex_v1_5H_vs_Morex_v2_5H.chop.og -o Morex_v1_5H_vs_Morex_v2_5H.chop.og.lay -t 16 -P odgi draw -i Morex_v1_5H_vs_Morex_v2_5H.chop.og -c Morex_v1_5H_vs_Morex_v2_5H.chop.og.lay -p Morex_v1_5H_vs_Morex_v2_5H.chop.og.lay.png -H 1000 -t 16

My graph with s=10000, p=85, a=90, k=16, and n=10:

[image: Morex_v1_5H_vs_Morex_v2_5H_s10000_p85_a90_n10 chop og lay_cropped] https://user-images.githubusercontent.com/8529807/93557214-f8ad0080-f9ac-11ea-9ff0-24f21e64aae7.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ekg/smoothxg/issues/6#issuecomment-694647998, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIJJZCYMRWHQUTPUQLSGLQ63ANCNFSM4RKYQTCQ .

ekg commented 4 years ago

I've found the problem with a quick check in gdb. It's the usual suspect, putting a really long sequence into spoa requires quadratic memory:

#12 0x00005555556b2033 in spoa::SimdAlignmentEngine<(spoa::Arch)3>::affine<spoa::InstructionSet<(spoa::Arch)3, int> > (this=0x7fff8011b720, 
    sequence=0x7fff80141740 "ATTCTT"..., sequence_size=139289, graph=std::unique_ptr<class spoa::Graph> = {...}, score=0x7fffedde58e0)

The sequence is 139kb long. There should probably be a check to break these up somehow.

@rvaser do you have any suggestions about the way to break them? I wonder if the graph will break apart if I tend to break the input sequences at the same places.

@brettChapman given where the error is happening, I'd guess you'll see the same thing if you run under gdb and pull out a backtrace.

ekg commented 4 years ago

@jmonlong this is probably the problem you saw too. Do you have a test you can run in gdb?

Like this:

gdb smoothxg -ex 'r <command line>'

Where are the parameters you gave to smoothxg.

Then, when it breaks, type bt and hit enter to get the backtrace. You'll see something like what I have above. Would be good to double check this is the problem for you too.

rvaser commented 4 years ago

@ekg, yes spoa has quadratic memory complexity and you can "catch" the allocation overflow before calling the Align function by calculating c * 4 * sequence.size() * graph.nodes().size(), where c equals 1, 3 or 5 for linear, affine and convex gaps, respectively. Or just put a try/catch block around the Align function.

For graph breaks I am not sure as I do not know the use case in smoothxg. Could you please elaborate a bit how exactly you are using spoa here?

ekg commented 4 years ago

Thanks for the tip about how to catch this. In smoothxg the goal is to build an POA for each local block of a variation graph. The block boundaries are determined using a greedy heuristic algorithm that tries to keep the problem size for each block no larger than some number of input base pairs.

Breaking apart the sequence and aligning each piece to the graph could introduce artifacts but it also might be approximately correct. I think I'll probably do this.

How is selection of linear, affine, and convex gaps done? I thoufht it was just based on the relative scores that are provided?

ekg commented 4 years ago

I found what's causing the higher-than expected input sequence length. It happens in VNTRs. We collect the region to run spoa on, and if we pick up a VNTR loop in the graph, the subsequences that traverse it all get concatenated. This makes them very long, more than the parameter setting of -w / path_depth would have us expect.

This graph represents a locus that causes this problem.

It's built from these sequences: oom_smoothxg_block_1436_oom.fa.txt

I can't get it to run with spoa:

spoa oom_smoothxg_block_1436_oom.fa                                 
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

But, I don't get the failure if I re-run the graph construction on just these sequences.

pggb -i oom_smoothxg_block_1436_oom.fa -s 50000 -p 70 -a 70 -n 10 -t 16 -v -l -k 4

Which lets us look at their relationship.

The paths all cover the full graph (odgi viz -b -w 5 -y 100):

oom_smoothxg_block_1436_oom fa pggb-s50000-p70-n10-a70-K16-k4-w10000-j5000-W0-e100 smooth gfa viz-paths

But their depth is very high in the VNTR (odgi viz -b -w 5 -m -y 100) (Blue means high depth, black = 1):

oom_smoothxg_block_1436_oom fa pggb-s50000-p70-n10-a70-K16-k4-w10000-j5000-W0-e100 smooth gfa viz-depth

You can see that in the Bandage view. odgi is annotating the nodes with their coverage, and Bandage plots this as line width.

Screenshot from 2020-09-19 16-01-21

ekg commented 4 years ago

My graph with s=10000, p=85, a=90, k=16, and n=10:

Also, using the odgi layout -N parameter will give you a gaussian 0,0-centered layout, which might make it more legible.

rvaser commented 4 years ago

@ekg, yes the gap model depends on the alignment parameters as following:

linear if g >= e
affine if g <= q or e >= c
convex otherwise

brettChapman commented 4 years ago

@ekg Thanks for the updates. I've generated some new alignments now with s100K, and I'll also try odgi layout with -N to see if it improves the layout of the graph. I've also been thinking it may be worthwhile aligning particular gene sequence regions (1kbp upstream/downstream) pulled from each genome to at least test parameters on a smaller scale and to improve the throughput of results. It'll also help my work colleagues here to answer some of their burning questions sooner for particular genes of interest. Apart from the edyeet -s segment length parameter, would most parameters scale to the whole genome well, if they work well on a scale of say max sequence length of 100Kbps for single gene regions (the largest gene is around 130Kbps)?

ekg commented 4 years ago

It is definitely worth focusing on small regions. I think that is essential to get a sense of what the different parameters can do. That said, in big graphs you can get different patterns of collapse, so keep that in mind when testing with small graphs. The runtime of alignment can become limiting in some cases as well. It's a balance: you want to test small to see what's going on, but not all the parameters scale perfectly due to the presence of additional matching material in the whole pangenome graph.

ekg commented 4 years ago

This is the prep.gfa graph (obtained with -K to keep the temporary files around).

There is a really intense tangle that arises because of the low -k parameter. This makes it basically impossible for the heuristics that try to break the input sequences to spoa at SV boundaries to work. The sorting puts this tangle in the same part of the graph, but the jumps between successive path steps in the order tend to be pretty short, and loopy (I think). Will try to see if heuristics are enough. If not, I may need to run homology analysis on the sequences to decide how to break them.

Screenshot from 2020-09-21 09-41-59

ekg commented 4 years ago

Plotting the traversals of this tangle (as seen from smoothxg's perspective) shows a clear periodicity.

Screenshot from 2020-09-21 11-36-56

Fewer cycles are present in haplotypes that have a shorter VNTR allele length.

Screenshot from 2020-09-21 11-37-15

I'd like to break these sequences into pieces so that each repeat unit is a separate input sequence to spoa. This should be controlled by some parameter which defines the size of SDs/VNTRs to maintain as cyclic in the output.

Because we are not looping "cleanly" through the locus in the prep graph (we're in fact running smoothxg is to make the loops clean) it's not clear if we can just count how many times we've seen a given node to try to decide when to break the sequence. There are loops within loops, and many smaller nodes get huge numbers of steps on them. We can't just break if the self step count of a path increases.

It will be very difficult to accurately determine the period without using statistical analysis. That said, autocorrelation plots make it pretty clear that we can use such an analysis to find a repeat length. But finding the starts and ends of those repeats in the sequences, so as to break them, still seems tricky to me.

Screenshot from 2020-09-21 11-42-23

Screenshot from 2020-09-21 11-43-35

ekg commented 4 years ago

The structure of the repeats is very apparent in autocorrelation. However, the exact repeat length is not stable, and so finding the start and end is not easy unless we use alignment. The only reasonable solution appears to be to run the self alignment of sequences in these loci and break them at obvious repeat and inversion starts.