vgteam / GetBlunted

For bluntifying overlapped GFAs
12 stars 0 forks source link

Segmentation fault on compacted DBG #52

Open Wenhai-Zhang opened 11 months ago

Wenhai-Zhang commented 11 months ago

Hi, I try to use get_blunted(pre-compile) on graphs which are compacted de Bruijn graph from Bifrost. My system is centos7 with gcc 9.4.

I test all the graphs which construct with 2 or 10 or 100 bacteria genomes and they report error segmentation fault. Here I show the error ran on the graph constructed with 10 genomes.

[get_blunted] Executing command: get_blunted -V -i bifrost_10_graph.gfa
[get_blunted : 0.0 s elapsed] Reading GFA...
[get_blunted : 0.0 s elapsed] Computing adjacency components...
[get_blunted : 0.0 s elapsed] Total adjacency components: 22273
[get_blunted : 0.0 s elapsed] Computing biclique covers...
Segmentation fault (core dumped)

The coredump file detail

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-119.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/wenhai/application/GetBlunted/get_blunted...done.
[New LWP 16181]
Core was generated by `get_blunted -V -i bifrost_10_graph.gfa'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004ac160 in bluntifier::ReducedDualGraph::convert_to_biclique_cover(std::vector<unsigned long, std::allocator<unsigned long> >&) const ()

Could you please help me out? Thank you in advance!

jeizenga commented 11 months ago

Definitely looks like a bug. Are you able to provide the input data so I can reproduce it?

Wenhai-Zhang commented 11 months ago

bifrost_10_graph.gfa.gz Oh, sure, I upload the gfa file (~12M).

jeizenga commented 11 months ago

I've implemented a solution in the master branch, so if you rebuild you should be able to run without crashing. The solution is a bit of a hack though. Probably a better way to avoid this problem is to rebuild the DBG with an odd-length overlap. This is a common technique that prevents overlaps from being DNA palindromes (because the middle base always changes during reverse complementation). These palindromic overlaps were the fundamental cause of the bug.

Wenhai-Zhang commented 10 months ago

Thank you very much for your help, I spent some time resolving the error during the installation of GetBlunted and I'm sorry to say this issue is not solved yet. @jeizenga

First, when I ran compacted DBG(cDBG) built with 10 genomes, it does work, but when I ran cDBG built with 100 genomes, it still reports this error. And probably I will run cDBG built with thousand of bacterial genomes. The data (~130M) cDBG with 100 genomes: https://drive.google.com/file/d/1Vg2Sq0Zja-xdrhscZCOYRxqRV5_-DiuR/view?usp=sharing. You can have a try.

Second, I'm very confused about building DBG with odd-length overlaps. Normally, DBG is built with odd kmer to avoid palindromes, so the overlap should be the kmer length minus 1, whose length should be an even number. I mean, in general, palindromes appear when kmer is even. For cDBG, due to compression, nodes or say untigs become arbitrary lengths, palindromic situations are inevitable. In this case, using an even-length kmer(odd-length overlap) would be very complicated and perhaps become a big problem.

Third, assuming you are right, although I do not agree. I built a cDBG with kmer 30(that is overlap length 29) using 100 genomes. It still report a error, but it is not the same as the previous.

[get_blunted] Executing command: get_blunted -V -i bifrost_n100k30_graph.gfa
[get_blunted : 0.0 s elapsed] Reading GFA...
WARNING: skipping overlap that reverses back onto the same sequence (in de Bruijn graphs, these can be avoided by choosing an odd-numbered overlap length): 17840
WARNING: skipping overlap that reverses back onto the same sequence (in de Bruijn graphs, these can be avoided by choosing an odd-numbered overlap length): 27402
WARNING: skipping overlap that reverses back onto the same sequence (in de Bruijn graphs, these can be avoided by choosing an odd-numbered overlap length): 18872
WARNING: skipping overlap that reverses back onto the same sequence (in de Bruijn graphs, these can be avoided by choosing an odd-numbered overlap length): 8998
WARNING: skipping overlap that reverses back onto the same sequence (in de Bruijn graphs, these can be avoided by choosing an odd-numbered overlap length): 16946
WARNING: skipping overlap that reverses back onto the same sequence (in de Bruijn graphs, these can be avoided by choosing an odd-numbered overlap length): 13537
WARNING: skipping overlap that reverses back onto the same sequence (in de Bruijn graphs, these can be avoided by choosing an odd-numbered overlap length): 36227
WARNING: skipping overlap that reverses back onto the same sequence (in de Bruijn graphs, these can be avoided by choosing an odd-numbered overlap length): 38086
WARNING: skipping overlap that reverses back onto the same sequence (in de Bruijn graphs, these can be avoided by choosing an odd-numbered overlap length): 38303
WARNING: skipping overlap that reverses back onto the same sequence (in de Bruijn graphs, these can be avoided by choosing an odd-numbered overlap length): 38418
suppressing further warnings
[get_blunted : 21.0 s elapsed] Computing adjacency components...
[get_blunted : 25.0 s elapsed] Total adjacency components: 1303846
[get_blunted : 25.0 s elapsed] Computing biclique covers...
[get_blunted : 37.0 s elapsed] Total biclique covers: 1303164
[get_blunted : 38.0 s elapsed] Duplicating node termini...
[get_blunted : 1.2 m elapsed] Harmonizing biclique edge orientations...
[get_blunted : 1.2 m elapsed] Aligning overlaps...
[get_blunted : 1.5 m elapsed] Splicing 1303164 subgraphs...
[get_blunted : 2.0 m elapsed] Splicing overlapping overlap nodes...
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 18446744073709551589) > this->size() (which is 1)
Aborted(core dump)
rlorigro commented 10 months ago

Hi Wenhai, we are still looking into this. However, I am wondering what your intended usage of the bluntified graph is. Getblunted works by duplicating sequence to ensure that no new adjacencies are produced. In a de Bruijn graph with 1000 samples, you are going to have a large amount of variation/error and densely overlapping nodes. Bluntification is still going to result in a very fragmented and sequence-redundant graph.

Here is how your graph looks when attempting to do a 2d layout in Bandage:

image

Methods that are intended for conventional sequence graphs are not exactly going to thrive in this environment.

The bug you are seeing in the latest output is a result of attempting to stitch overlapping overlaps, however I cannot reproduce your result because you haven't shared the k30 graph. If you share it, I can take a look at that error. My guess is that there are chains of overlapping overlaps, which are a bit outside of the scope I tested for.

Wenhai-Zhang commented 10 months ago

Oh, thank you very much for your sharing layout in Bandage. Perhaps due to server configuration issues, I can't get the layout in Bandage.

I am wondering what your intended usage of the bluntified graph is.

I'm going to map short reads to the bluntified graph with vg giraffe. This is just my initial idea and I choose to use compacted de bruijn because now in terms of memory usage and runtime it perform well in building with a few thousands of metagenomes . But when I ran vg giraffe, it can only input bluntifed graph and it said GetBlunted can process my file into bluntified graph. So I have a try.

In a de Bruijn graph with 1000 samples,

Let me correct it, my graph is built with 100 genomes.

Methods that are intended for conventional sequence graphs are not exactly going to thrive in this environment.

Do you mean that compacted de bruijn is not suitable to blunt?

If you share it, I can take a look at that error.

I'm more than happy to share it. https://drive.google.com/file/d/1Akb9te6dDzSWIb43LGrPareJisGuXEik/view?usp=sharing

I have read your paper, and I try to run gimbricate/seqwish pipline and stark. The stark also report segmentation fault. The gimbricate/seqwish pipline can work well but it add many P lines which are not actual genome path in gfa file and these path probably is not what I want.

jeizenga commented 10 months ago

We're working on resolving this bug. However, I should caution you that vg giraffe is not really meant for this kind of graph. It's been designed some assumptions of structural simplicity in the graph. Also, it works best when you have known haplotypes/genomes that cover the entire graph (expressed as W lines in the GFA). For those simple graphs, vg giraffe is very fast. For complex graphs like this one, it can be impractically slow.

For a similar project, we found that we got better results converting the GFA nodes into a FASTA and mapping with bwa mem. If your graph typically has nodes that are much longer than reads, this works pretty well. With your data, you may want to consider this approach as well.

Wenhai-Zhang commented 10 months ago

Thank you very much for this good advice. I will make it as a new approach to try.