vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

VG surject with cactus graph #2812

Open RenzoTale88 opened 4 years ago

RenzoTale88 commented 4 years ago

Good morning, I've aligned ATAC-seq data to a five-genomes graph generate through cactus. The alignment was performed through VG 1.23.0, and then I've tried to surject the reads to one of the genomes using v1.24.0 with the command:

vg surject -t 1 -b -x my.xg mygam.gam > mygam.bam

However, I get the following error:

Crash report for vg v1.24.0 "Montieri"
Stack trace (most recent call last):
#19   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x4ddbb9, in _start
#18   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x1c41438, in __libc_start_main
#17   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x40a627, in main
#16   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xa33cf7, in vg::subcommand::Subcommand::operator()(int, char**) const
#15   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x9bd670, in main_surject(int, char**)
#14   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xaf486c, in vg::get_input_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (std::istream&)>)
#13   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x9bb754, in std::_Function_handler<void (std::istream&), main_surject(int, char**)::{lambda(std::istream&)#4}>::_M_invoke(std::_Any_data const&, std::istream&)
#12   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x1c2734e, in GOMP_parallel
#11   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x9bf639, in void vg::io::for_each_parallel_impl<vg::Alignment>(std::istream&, std::function<void (vg::Alignment&, vg::Alignment&)> const&, std::function<void (vg::Alignment&)> const&, std::function<bool ()> const&, unsigned long) [clone ._omp_fn.0]
#10   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x1c2c144, in GOMP_task
#9    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x9bef2d, in void vg::io::for_each_parallel_impl<vg::Alignment>(std::istream&, std::function<void (vg::Alignment&, vg::Alignment&)> const&, std::function<void (vg::Alignment&)> const&, std::function<bool ()> const&, unsigned long) [clone ._omp_fn.1]
#8    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x68a5ab, in std::_Function_handler<void (vg::Alignment&, vg::Alignment&), void vg::io::for_each_parallel<vg::Alignment>(std::istream&, std::function<void (vg::Alignment&)> const&, unsigned long)::{lambda(vg::Alignment&, vg::Alignment&)#1}>::_M_invoke(std::_Any_data const&, vg::Alignment&, vg::Alignment&)
#7    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x9bfad9, in main_surject(int, char**)::{lambda(std::istream&)#4}::operator()(std::istream&) const::{lambda(vg::Alignment&)#2}::operator()(vg::Alignment) const
#6    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xdf988f, in vg::Surjector::surject(vg::Alignment const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool, bool) const
#5    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xdf8e1e, in vg::Surjector::surject(vg::Alignment const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, long&, bool&, bool, bool) const
#4    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xb382f2, in vg::MultipathAlignmentGraph::align(vg::Alignment const&, handlegraph::HandleGraph const&, vg::GSSWAligner const*, bool, unsigned long, bool, unsigned long, double, unsigned long, vg::MultipathAlignment&, bool)
#3    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xb35cb7, in vg::MultipathAlignmentGraph::align(vg::Alignment const&, handlegraph::HandleGraph const&, vg::GSSWAligner const*, bool, unsigned long, bool, unsigned long, double, std::function<unsigned long (vg::Alignment const&, handlegraph::HandleGraph const&)>, vg::MultipathAlignment&, bool)
#2    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xbd43f1, in vg::Aligner::align_global_banded_multi(vg::Alignment&, std::vector<vg::Alignment, std::allocator<vg::Alignment> >&, handlegraph::HandleGraph const&, int, int, bool) const
#1    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xe3f6b0, in vg::BandedGlobalAligner<int>::align(signed char*, signed char*, signed char, signed char)
#0    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xe2607c, in vg::BandedGlobalAligner<int>::BAMatrix::fill_matrix(handlegraph::HandleGraph const&, signed char*, signed char*, signed char, signed char, bool, int)

I get a similar error when I specify a list of paths using -F or a single path using -p:

Crash report for vg v1.24.0 "Montieri"
Stack trace (most recent call last):
#19   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x4ddbb9, in _start
#18   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x1c41438, in __libc_start_main
#17   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x40a627, in main
#16   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xa33cf7, in vg::subcommand::Subcommand::operator()(int, char**) const
#15   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x9bd670, in main_surject(int, char**)
#14   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xaf486c, in vg::get_input_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (std::istream&)>)
#13   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x9bb754, in std::_Function_handler<void (std::istream&), main_surject(int, char**)::{lambda(std::istream&)#4}>::_M_invoke(std::_Any_data const&, std::istream&)
#12   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x1c2734e, in GOMP_parallel
#11   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x9bf639, in void vg::io::for_each_parallel_impl<vg::Alignment>(std::istream&, std::function<void (vg::Alignment&, vg::Alignment&)> const&, std::function<void (vg::Alignment&)> const&, std::function<bool ()> const&, unsigned long) [clone ._omp_fn.0]
#10   Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x1c2c144, in GOMP_task
#9    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x9bef2d, in void vg::io::for_each_parallel_impl<vg::Alignment>(std::istream&, std::function<void (vg::Alignment&, vg::Alignment&)> const&, std::function<void (vg::Alignment&)> const&, std::function<bool ()> const&, unsigned long) [clone ._omp_fn.1]
#8    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x68a5ab, in std::_Function_handler<void (vg::Alignment&, vg::Alignment&), void vg::io::for_each_parallel<vg::Alignment>(std::istream&, std::function<void (vg::Alignment&)> const&, unsigned long)::{lambda(vg::Alignment&, vg::Alignment&)#1}>::_M_invoke(std::_Any_data const&, vg::Alignment&, vg::Alignment&)
#7    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0x9bfad9, in main_surject(int, char**)::{lambda(std::istream&)#4}::operator()(std::istream&) const::{lambda(vg::Alignment&)#2}::operator()(vg::Alignment) const
#6    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xdf988f, in vg::Surjector::surject(vg::Alignment const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool, bool) const
#5    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xdf8e1e, in vg::Surjector::surject(vg::Alignment const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, long&, bool&, bool, bool) const
#4    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xb382f2, in vg::MultipathAlignmentGraph::align(vg::Alignment const&, handlegraph::HandleGraph const&, vg::GSSWAligner const*, bool, unsigned long, bool, unsigned long, double, unsigned long, vg::MultipathAlignment&, bool)
#3    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xb35cb7, in vg::MultipathAlignmentGraph::align(vg::Alignment const&, handlegraph::HandleGraph const&, vg::GSSWAligner const*, bool, unsigned long, bool, unsigned long, double, std::function<unsigned long (vg::Alignment const&, handlegraph::HandleGraph const&)>, vg::MultipathAlignment&, bool)
#2    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xbd43f1, in vg::Aligner::align_global_banded_multi(vg::Alignment&, std::vector<vg::Alignment, std::allocator<vg::Alignment> >&, handlegraph::HandleGraph const&, int, int, bool) const
#1    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xe3f6b0, in vg::BandedGlobalAligner<int>::align(signed char*, signed char*, signed char, signed char)
#0    Object "/PATH/TO/Andrea/vg/vg-1.24.0", at 0xe2609f, in vg::BandedGlobalAligner<int>::BAMatrix::fill_matrix(handlegraph::HandleGraph const&, signed char*, signed char*, signed char, signed char, bool, int)

Not sure how to proceed. Is it something caused by the use of a cactus graph as input? Also, I have generated the indexes with a previous version of vg, is that going to affect this stage? Thank you in advance for the help Andrea

jeizenga commented 4 years ago

Hi Andrea, Surject really shouldn't crash with a cactus graph, although some parts of its core algorithm might be slow. Can you share the data to reproduce this bug?

RenzoTale88 commented 4 years ago

Hi @jeizenga thanks for the quick reply! I'm not sure if I can manage it. The files are large (tens of Gb), both for the genomes and the reads. I'll check what I can do and come back as soon as I can. Sorry for the delay Andrea

RenzoTale88 commented 4 years ago

Hi @jeizenga, I've got a question regarding vg surject. If I try to run it with the flag -l to get local instead of global, the software seems to be generating the bam file just fine. Do you think the resulting bam will be dramatically different from the one obtained without the same flag? Thank you in advance Andrea

jeizenga commented 4 years ago

It's probably quite different in regions around large genomic variations. For instance, if a read aligns over a large deletion, then without -l the alignment will always preserve the deletion. With -l, it will probably soft-clip one side of the deletion in order to avoid having a negative alignment score.

RenzoTale88 commented 4 years ago

@jeizenga thank you for your explanation, very clear! Ok I might be able to share a subset of reads and a chunk of the graph. I've generated these chunks with vg chunk, separating by component. I'll share the data through onedrive if you can provide me with an email.

jeizenga commented 4 years ago

Sure, you can send the link to joeizeng@gmail.com

jeizenga commented 4 years ago

Okay, I've had a chance to look into this a little bit. In a certain sense, the algorithm is working correctly. However, the projection down to one of the paths requires it to realign 16,655,225 bp of reference sequence to 1 bp of read sequence (probably around some large deletion in one of the genomes). As currently implemented, this seems to require too much memory and it crashes. I'd like to look into some alternate alignment strategies for cases like this one, but if you have a high-memory machine available you could also use that in the meantime -- although it will be slow.

RenzoTale88 commented 4 years ago

@jeizenga thanks for the quick reply! Maybe it can be a good choice ot try with more stringent trimming options, and then I'll try using more memory while surjecting. What can be a reasonable estimate in your opinion? Thank again Andrea

jeizenga commented 4 years ago

I believe you mean trimming the reads, right? If so, I think that probably won't be an effective workaround. This has more to do with the structure of the graph than the quality of the reads. In the area where I found the problem, the graph looks something like this:

           /-(2: 1  bp)-\
(1: 100 bp)              (4: 100 bp)
           \-(3: 1M bp)-/

The read follows the path 1 -> 2 -> 4, and the reference you're projecting onto follows the path 1 -> 3 -> 4. The alignment of the read gets anchored to the reference on nodes 1 and 4, and then the algorithm sees that there is 1 base of sequence that it needs to realign to the reference path. Unfortunately, the algorithm isn't written to be aware of the fact that the read sequence aligned to 2 and the sequence of 3 are such different lengths, which is where we run into problems.

Upshot: any read aligned to this motif is going to have similar problems, regardless of the quality of the read.

ekg commented 4 years ago

Could the surjection be driven by a banded alignment run from each anchoring position on the reference outward? I'm thinking of dozeu.

On Sat, Jun 6, 2020, 20:46 Jordan Eizenga notifications@github.com wrote:

I believe you mean trimming the reads, right? If so, I think that probably won't be an effective workaround. This has more to do with the structure of the graph than the quality of the reads. In the area where I found the problem, the graph looks something like this:

       /-(2: 1  bp)-\

(1: 100 bp) (4: 100 bp) -(3: 1M bp)-/

The read follows the path 1 -> 2 -> 4, and the reference you're projecting onto follows the path 1 -> 3 -> 4. The alignment of the read gets anchored to the reference on nodes 1 and 4, and then the algorithm sees that there is 1 base of sequence that it needs to realign to the reference path. Unfortunately, the algorithm isn't written to be aware of the fact that the read sequence aligned to 2 and the sequence of 3 are such difference lengths, which is where we run into problems.

Upshot: any read aligned to this motif is going to have similar problems, regardless of the quality of the read.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2812#issuecomment-640102373, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEI2FFYJPP3NK47OMGTRVKFITANCNFSM4NJJOKSA .