vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

short read giraffe alignment crashes most of the time, works on a few random samples - Signal 11 #4290

Closed jespindel01 closed 1 month ago

jespindel01 commented 1 month ago

1. What were you trying to do? I'm trying to align paired end fastq short reads to a single chromosome pangenome, using vg giraffe. In trying to troubleshoot, I changed the number of threads to 1 to ensure I was not running out of memory, and still got the error. I also confirmed by watching memory usage that the error occurred before even 10% of the memory was used. Here is an example call that caused the crash:

./vg giraffe -p -Z myPG.gbz -d index.dist -m index.min -f reads_fetched/lib_reads_R1.fastq -f reads_fetched/Iib_reads_R2.fastq -t 1 > ./giraffe_output/lib.gam

2. What did you want to happen? I wanted to get the read alignments to the pangenome. A small number of commands in which I passed only a single fastq file ran to completion, but for the vast majority of samples, vg crashes with the stack trace below. Every time I try and pass two fastq files for paired end data, I get the same vg crash message.

3. What actually happened? vg crashed

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

Crash report for vg v1.56.0 "Collalto"
Stack trace (most recent call last):
#18   Object "/opt/notebooks/vg", at 0x61c2f4, in _start
#17   Object "/opt/notebooks/vg", at 0x20df726, in __libc_start_main
#16   Object "/opt/notebooks/vg", at 0x20dde89, in __libc_start_call_main
#15   Object "/opt/notebooks/vg", at 0xe0e50b, in vg::subcommand::Subcommand::operator()(int, char**) const
#14   Object "/opt/notebooks/vg", at 0xd316ea, in main_giraffe(int, char**)
#13   Object "/opt/notebooks/vg", at 0xdc8d7f, in std::_Function_handler<void (std::function<void ()> const&), vg::subcommand::TickChainLink::get_iterator()::{lambda(std::function<void ()> const&)#1}>::_M_invoke(std::_Any_data const&, std::function<void ()> const&)
#12   Object "/opt/notebooks/vg", at 0xd2cccc, in main_giraffe(int, char**)::{lambda()#1}::operator()() const
#11   Object "/opt/notebooks/vg", at 0xec3a9f, in vg::fastq_paired_two_files_for_each_parallel_after_wait(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (vg::Alignment&, vg::Alignment&)>, std::function<bool ()>, unsigned long)
#10   Object "/opt/notebooks/vg", at 0x20bb645, in GOMP_parallel
#9    Object "/opt/notebooks/vg", at 0xecca15, in unsigned long vg::io::paired_for_each_parallel_after_wait<vg::Alignment>(std::function<bool (vg::Alignment&, vg::Alignment&)>, std::function<void (vg::Alignment&, vg::Alignment&)>, std::function<bool ()>, unsigned long) [clone ._omp_fn.0]
#8    Object "/opt/notebooks/vg", at 0xd2d541, in std::_Function_handler<void (vg::Alignment&, vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&, vg::Alignment&)#6}>::_M_invoke(std::_Any_data const&, vg::Alignment&, vg::Alignment&)
#7    Object "/opt/notebooks/vg", at 0x119154c, in vg::MinimizerMapper::map_paired(vg::Alignment&, vg::Alignment&, std::vector<std::pair<vg::Alignment, vg::Alignment>, std::allocator<std::pair<vg::Alignment, vg::Alignment> > >&)
#6    Object "/opt/notebooks/vg", at 0x11864ca, in vg::MinimizerMapper::map_from_extensions(vg::Alignment&)
#5    Object "/opt/notebooks/vg", at 0x1183fa9, in void vg::MinimizerMapper::process_until_threshold_c<double>(unsigned long, std::function<double (unsigned long)> const&, std::function<bool (unsigned long, unsigned long)> const&, double, unsigned long, unsigned long, vg::LazyRNG&, std::function<bool (unsigned long)> const&, std::function<void (unsigned long)> const&, std::function<void (unsigned long)> const&) const [clone .isra.0]
#4    Object "/opt/notebooks/vg", at 0x118077a, in std::_Function_handler<bool (unsigned long), vg::MinimizerMapper::map_from_extensions(vg::Alignment&)::{lambda(unsigned long)#4}>::_M_invoke(std::_Any_data const&, unsigned long&&)
#3    Object "/opt/notebooks/vg", at 0x117f767, in vg::MinimizerMapper::extend_cluster(vg::SnarlDistanceIndexClusterer::Cluster const&, unsigned long, vg::VectorView<vg::MinimizerMapper::Minimizer> const&, std::vector<vg::SnarlDistanceIndexClusterer::Seed, std::allocator<vg::SnarlDistanceIndexClusterer::Seed> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >&, vg::Funnel&) const
#2    Object "/opt/notebooks/vg", at 0x100a141, in vg::GaplessExtender::extend(vg::pair_hash_set<std::pair<handlegraph::handle_t, long> >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, gbwtgraph::CachedGBWTGraph const*, unsigned long, double) const
#1    Object "/opt/notebooks/vg", at 0x1001b83, in vg::match_initial(vg::GaplessExtension&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<char const*, unsigned long>)
#0    Object "/opt/notebooks/vg", at 0x2168690, in __memcpy_avx_unaligned_erms_rtm
ERROR: Signal 11 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.

5. What data and command can the vg dev team use to make the problem happen? to get the inputs to vg, I started with a .gfa file output from pggb - I ran the following to get the inputs for vg giraffe:

./vg autoindex --workflow giraffe -g myPG.gfa -t 127

#get gbwt format
./vg gbwt -o myPG.gbwt -G myPG.gfa 

#get GBZ format
./vg gbwt --gbz-format -gmyPG.gbz -G myPG.gfa

For the fastq inputs, the files are simple, paired end short read fastq files that have been filtered so they do not contain every read off the sequencer. The format adheres to standard fastq.

6. What does running vg version say?

vg version v1.56.0 "Collalto"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by anovak@courtyard.gi.ucsc.edu
jltsiren commented 1 month ago

The GBZ graph you built manually is not the same graph as the one you indexed using vg autoindex. For various practical reasons, long nodes must be chopped into shorter fragments before they can be used in vg. vg autoindex chops the nodes to 32 bp, while the vg gbwt default is 1024 bp.

You should have a GBZ graph (probably index.giraffe.gbz) from vg autoindex which you can use. However, because you built the graph with PGGB, its structure could be too complex and Giraffe might be slow.

Additionally, your two vg gbwt commands run the same (potentially expensive) algorithm with different outputs. If you need a separate GBWT file, you can extract it much faster from the GBZ with vg gbwt -o graph.gbwt -Z graph.gbz.

jespindel01 commented 1 month ago

Thank you