vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

SLLS haplotype scoring can crash with out-of-range paths #1582

Open adamnovak opened 6 years ago

adamnovak commented 6 years ago

In /cluster/home/anovak/hive/trash/broken-slls you can run:

vg mpmap --linear-index 1kg_hg19-CHR21_filter.vcf.gz.slls --linear-path 21 -x snp1kg-CHR21_filter.xg -g snp1kg-CHR21_filter.gcsa --snarls snp1kg-CHR21_filter.snarls --fastq sim.fq.gz -i -S -t 32 >/dev/null

You should get (about 2.9 GB through the output):

terminate called after throwing an instance of 'std::runtime_error'
  what():  haplotype queried is outside of reference range

That probably shouldn't happen. It may be that the SLLS file was generated with the wrong VCF; is that supported? Should that produce an error earlier instead of after a long time?

yoheirosen commented 6 years ago

Ok something weird is going on

full chr21 slls file: start pos 0 end pos 48119739 # of sites 1058338 1 haplotype chr21 slls file: start pos 0 end pos 48119739 # of sites 1058219

Loaded the slls in gdb (I'm working in my "profiler" application since it starts by loading it into memory) and I get the same structure reflected in the haplotypeCohort and siteIndex objects actually created from the slls index. So I don't think there's anything weird going on with the index.

I then tried running

/bin/profiler speed /cluster/home/anovak/hive/trash/broken-slls/1kg_hg19-CHR21_filter.vcf.gz.slls experiment_4.expt

in /cluster/home/yrosen/normalslls/sublinear-Li-Stephens/. What this does is load the slls index, randomly generate 10 20000 bp long "reads" from it, score them (and print a whole bunch of statistics). These look normal and don't throw any errors on construction (example below:)

(gdb) break profiler.cpp:409 (gdb) run speed /cluster/home/anovak/hive/trash/broken-slls/1kg_hg19-CHR21_filter.vcf.gz.slls experiment_4.expt (gdb) print *query_ih $1 = {reference = 0x6ce170, alleles = std::vector of length 6, capacity 6 = {T, A, C, C, A, A}, novel_SNVs = std::vector of length 6, capacity 6 = {0, 0, 0, 0, 0, 0}, absolute_start_pos = 9937363, absolute_end_pos = 9951617, start_offset_wrt_ref = 9937363, start_site = 0, end_site = 4, left_tail_length = 0, right_tail_length = 2206, has_no_sites = false, invalid = false}

Note that these are however built from the same slls index getting passed in for the scoring process. In haplotypes.hpp these inputHaplotypes are instead getting built from the xg, which finds the start and end positions of the read calling a function which, after checking a bunch of things, returns

xg_index.position_in_path(node_id, xg_ref_rank).at(0);

Is there any way that this could be outside the range 0, 48119739?

I guess the thing to do is run the vg mpmap command you specified above within gdb, break at haplotypes.hpp and see what's wrong. Can I get a minimal test version of this which throws the same error? I'm going to assume that the command runs too big a process to be reasonable for gdb

It think to start I'd want to see a read which throws the error, the node id of its start and end nodes and then the values of

xg_index.position_in_path(node_id, xg_ref_rank).at(0);

for node_id = start node, end node

On Wed, Mar 28, 2018 at 12:26 PM, Adam Novak notifications@github.com wrote:

Assigned #1582 https://github.com/vgteam/vg/issues/1582 to @yoheirosen https://github.com/yoheirosen.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1582#event-1546394416, or mute the thread https://github.com/notifications/unsubscribe-auth/AHmcLp6rkqwNcoIAHmiOQesHJ4-RfReUks5ti-P3gaJpZM4S_OKd .