Open adamnovak opened 6 years ago
Ok something weird is going on
full chr21 slls file: start pos 0 end pos 48119739 # of sites 1058338 1 haplotype chr21 slls file: start pos 0 end pos 48119739 # of sites 1058219
Loaded the slls in gdb (I'm working in my "profiler" application since it starts by loading it into memory) and I get the same structure reflected in the haplotypeCohort and siteIndex objects actually created from the slls index. So I don't think there's anything weird going on with the index.
I then tried running
/bin/profiler speed /cluster/home/anovak/hive/trash/broken-slls/1kg_hg19-CHR21_filter.vcf.gz.slls experiment_4.expt
in /cluster/home/yrosen/normalslls/sublinear-Li-Stephens/. What this does is load the slls index, randomly generate 10 20000 bp long "reads" from it, score them (and print a whole bunch of statistics). These look normal and don't throw any errors on construction (example below:)
(gdb) break profiler.cpp:409 (gdb) run speed /cluster/home/anovak/hive/trash/broken-slls/1kg_hg19-CHR21_filter.vcf.gz.slls experiment_4.expt (gdb) print *query_ih $1 = {reference = 0x6ce170, alleles = std::vector of length 6, capacity 6 = {T, A, C, C, A, A}, novel_SNVs = std::vector of length 6, capacity 6 = {0, 0, 0, 0, 0, 0}, absolute_start_pos = 9937363, absolute_end_pos = 9951617, start_offset_wrt_ref = 9937363, start_site = 0, end_site = 4, left_tail_length = 0, right_tail_length = 2206, has_no_sites = false, invalid = false}
Note that these are however built from the same slls index getting passed in for the scoring process. In haplotypes.hpp these inputHaplotypes are instead getting built from the xg, which finds the start and end positions of the read calling a function which, after checking a bunch of things, returns
xg_index.position_in_path(node_id, xg_ref_rank).at(0);
Is there any way that this could be outside the range 0, 48119739?
I guess the thing to do is run the vg mpmap command you specified above within gdb, break at haplotypes.hpp and see what's wrong. Can I get a minimal test version of this which throws the same error? I'm going to assume that the command runs too big a process to be reasonable for gdb
It think to start I'd want to see a read which throws the error, the node id of its start and end nodes and then the values of
xg_index.position_in_path(node_id, xg_ref_rank).at(0);
for node_id = start node, end node
On Wed, Mar 28, 2018 at 12:26 PM, Adam Novak notifications@github.com wrote:
Assigned #1582 https://github.com/vgteam/vg/issues/1582 to @yoheirosen https://github.com/yoheirosen.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1582#event-1546394416, or mute the thread https://github.com/notifications/unsubscribe-auth/AHmcLp6rkqwNcoIAHmiOQesHJ4-RfReUks5ti-P3gaJpZM4S_OKd .
In
/cluster/home/anovak/hive/trash/broken-slls
you can run:You should get (about 2.9 GB through the output):
That probably shouldn't happen. It may be that the SLLS file was generated with the wrong VCF; is that supported? Should that produce an error earlier instead of after a long time?