vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

Understanding vg call output #469

Open hgibling opened 8 years ago

hgibling commented 8 years ago

I've mapped a bam file to my MHC graph and used vg call to produce a vcf file of variants, but I'm not sure how to interpret it. Many positions have multiple entries, and some of the IDs are duplicated. For example, position 30052145 has 23 entries (the first column is the number of line occurrences from grep 30052145 file.vcf | uniq -c):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
7 . 30052145    837884_837817_837813_1060355_837806_837804_837803_837801_837800_837799_837798_837796    TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTGGGATTGCAACCCCT 3.50629e-12 .   DP=10;XSEE=19303:0,19305:0,19307:0,19308:0,19310:0,19312:0,19313:0,19316:0,19324:0,19329:0,19421:0,321501:0,321502:0,321504:0,321505:0,321529:0,337671:0,357757:0,691962:0,691963:0,691963:18;XREF  GT:DP:AD:SB:XAAD:AL 1/1:10:0,10:0,0,0,10:10:-1e+100,-1.49947
1 . 30052145    837884_837817_837813_1060355_837806_837804_837803_837802_837800_837799_837798_837796    TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT .   -0  .   DP=10;XSEE=19303:0,19305:0,19307:0,19308:0,19311:0,19312:0,19313:0,19316:0,19324:0,19329:0,19421:0,321501:0,321502:0,321504:0,321505:0,321529:0,337671:0,357757:0,691962:0,691963:0,691963:18;XREF  GT:DP:AD:SB:XAAD:AL 0/0:10:0,10:0,0,0,10:10:-1e+100,-1.49947
6 . 30052145    837884_837817_837813_837811_1048822_1048819 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTA .   -0  .   DP=11;XSEE=19322:0,19324:0,19329:0,19421:0,321509:0,321514:0,321529:0,691962:0,691963:0,691963:18;XREF  GT:DP:AD:SB:XAAD:AL 0/0:11:0,11:0,0,0,11:11:-1e+100,-0.649482
1 . 30052145    837884_837817_837813_837811_837810_1048847_1048846_1048845_837801_837800_837799_837798_837796   TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTGGGATTGCAACCCCT -0  .   DP=2;XSEE=19303:0,19305:0,19307:0,19308:0,19310:0,19321:0,19322:0,19324:0,19329:0,19421:0,321501:0,321502:0,321504:0,321505:0,321529:0,321540:0,321541:0,321542:0,357757:0,691962:0,691963:0,691963:18;XREF GT:DP:AD:SB:XAAD:AL ./.:2:0,2:0,0,0,2:2:-1e+100,-1.49947
3 . 30052145    837884_837817_837813_837811_837810_1048847_1048846_837908_837803_837801_837800_837799_837798_837796 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTGGGATTGCAACCCCT -0  .   DP=0;XSEE=19303:0,19305:0,19307:0,19308:0,19310:0,19312:0,19321:0,19322:0,19324:0,19329:0,19421:0,19452:0,321501:0,321502:0,321504:0,321505:0,321529:0,321541:0,321542:0,357757:0,691962:0,691963:0,691963:18;XREF  GT:DP:AD:SB:XAAD:AL ./.:0:0,0:0,0,0,0:0:-1e+100,-1.49947
2 . 30052145    837884_837817_837813_837811_837810_1048847_1048846_837908_837907_837905_837904_837799_837798_837796 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGAGTAGGATTGCAACCCCT -0  .   DP=8;XSEE=19303:0,19305:0,19307:0,19321:0,19322:0,19324:0,19329:0,19421:0,19448:0,19449:0,19451:0,19452:0,321501:0,321502:0,321504:0,321505:0,321529:0,321541:0,321542:0,357757:0,691962:0,691963:0,691963:18;XREF  GT:DP:AD:SB:XAAD:AL ./.:8:0,8:0,0,0,8:8:-1e+100,-1.49947
1 . 30052145    837884_837817_837813_837811_837810_1048847_1048846_837908_837907_837906_837904_837799_837798_837796 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT .   -0  .   DP=5;XSEE=19303:0,19305:0,19307:0,19321:0,19322:0,19324:0,19329:0,19421:0,19448:0,19450:0,19451:0,19452:0,321501:0,321502:0,321504:0,321505:0,321529:0,321541:0,321542:0,357757:0,691962:0,691963:0,691963:18;XREF  GT:DP:AD:SB:XAAD:AL ./.:5:0,5:0,0,0,5:5:-1e+100,-1.49947
1 . 30052145    837884_837817_837813_837811_837810_837804_837803_837801_837800_837799_837798_837796 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTGGGATTGCAACCCCT 1.52995e-11 .   DP=11;XSEE=19303:0,19305:0,19307:0,19308:0,19310:0,19312:0,19313:0,19321:0,19322:0,19324:0,19329:0,19421:0,321501:0,321502:0,321504:0,321505:0,321529:0,357757:0,691962:0,691963:0,691963:18;XREF   GT:DP:AD:SB:XAAD:AL 1/1:11:0,11:0,0,0,11:11:-1e+100,-1.49947
1 . 30052145    837884_837817_837816_837811_1048822_1048819 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTA .   -0  .   DP=27;XSEE=19322:0,19328:0,19329:0,19421:0,321509:0,321514:0,321529:0,691962:0,691963:0,691963:18;XREF  GT:DP:AD:SB:XAAD:AL 0/0:27:0,27:0,0,0,27:27:-1e+100,-1.60631

First, what are the IDs (837884_837817_837813_837811_1048822_1048819)? I initially thought they were node IDs, but they don't seem to exist when I try vg find -n 837884 -c 2 -x graph.xg | vg view - (I get an Assertionidx < this->size()' failed.` error, which I've gotten before when searching for a non-existent node).

Second, should the same position, ID, ref and alt be listed more than once? I haven't yet managed to get decomposing block subs or normalization through vt to work yet, so I'm not sure if this would clear up some of the duplications.

(Relatedly, the readme says vt is found at http://varianttools.sourceforge.net/Association/HomePage, but I'm assuming this should be http://genome.sph.umich.edu/wiki/Vt ?)

glennhickey commented 8 years ago

Hi Heather,

Which version are you using? The duplicate nodes ought not to be an issue with the latest master (as of about 1 week ago). If you're on the latest, and can share your data and command line, please do!

The VCF ids are indeed node ids. The catch is that they are ids in the "augmented graph". vg call creates this intermediate graph by adding (and smoothing) novel stuff from the pileup into your input graph, splitting up nodes and changing most ids in the process. The non-reference parths of the augmented graph that meet the given thresholds of read support are then translated to VCF entries. You can get this graph with the -A option to vg call if you want to view it -- it will contain all the ids from the VCF.

Oh boy the readme link is indeed wrong. you have the right one! Sorry, will fix on my next PR.

cheers -Glenn

On Wed, Sep 7, 2016 at 5:03 PM, Heather Gibling notifications@github.com wrote:

I've mapped a bam file to my MHC graph and used vg call to produce a vcf file of variants, but I'm not sure how to interpret it. Many positions have multiple entries, and some of the IDs are duplicated. For example, position 30052145 has 23 entries (the first column is the number of line occurrences from grep 30052145 file.vcf | uniq -c):

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE

7 . 30052145 837884_837817_837813_1060355_837806_837804_837803_837801_837800_837799_837798_837796 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTGGGATTGCAACCCCT 3.50629e-12 . DP=10;XSEE=19303:0,19305:0,19307:0,19308:0,19310:0,19312:0,19313:0,19316:0,19324:0,19329:0,19421:0,321501:0,321502:0,321504:0,321505:0,321529:0,337671:0,357757:0,691962:0,691963:0,691963:18;XREF GT:DP:AD:SB:XAAD:AL 1/1:10:0,10:0,0,0,10:10:-1e+100,-1.49947 1 . 30052145 837884_837817_837813_1060355_837806_837804_837803_837802_837800_837799_837798_837796 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT . -0 . DP=10;XSEE=19303:0,19305:0,19307:0,19308:0,19311:0,19312:0,19313:0,19316:0,19324:0,19329:0,19421:0,321501:0,321502:0,321504:0,321505:0,321529:0,337671:0,357757:0,691962:0,691963:0,691963:18;XREF GT:DP:AD:SB:XAAD:AL 0/0:10:0,10:0,0,0,10:10:-1e+100,-1.49947 6 . 30052145 837884_837817_837813_837811_1048822_1048819 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTA . -0 . DP=11;XSEE=19322:0,19324:0,19329:0,19421:0,321509:0,321514:0,321529:0,691962:0,691963:0,691963:18;XREF GT:DP:AD:SB:XAAD:AL 0/0:11:0,11:0,0,0,11:11:-1e+100,-0.649482 1 . 30052145 837884_837817_837813_837811_837810_1048847_1048846_1048845_837801_837800_837799_837798_837796 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTGGGATTGCAACCCCT -0 . DP=2;XSEE=19303:0,19305:0,19307:0,19308:0,19310:0,19321:0,19322:0,19324:0,19329:0,19421:0,321501:0,321502:0,321504:0,321505:0,321529:0,321540:0,321541:0,321542:0,357757:0,691962:0,691963:0,691963:18;XREF GT:DP:AD:SB:XAAD:AL ./.:2:0,2:0,0,0,2:2:-1e+100,-1.49947 3 . 30052145 837884_837817_837813_837811_837810_1048847_1048846_837908_837803_837801_837800_837799_837798_837796 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTGGGATTGCAACCCCT -0 . DP=0;XSEE=19303:0,19305:0,19307:0,19308:0,19310:0,19312:0,19321:0,19322:0,19324:0,19329:0,19421:0,19452:0,321501:0,321502:0,321504:0,321505:0,321529:0,321541:0,321542:0,357757:0,691962:0,691963:0,691963:18;XREF GT:DP:AD:SB:XAAD:AL ./.:0:0,0:0,0,0,0:0:-1e+100,-1.49947 2 . 30052145 837884_837817_837813_837811_837810_1048847_1048846_837908_837907_837905_837904_837799_837798_837796 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGAGTAGGATTGCAACCCCT -0 . DP=8;XSEE=19303:0,19305:0,19307:0,19321:0,19322:0,19324:0,19329:0,19421:0,19448:0,19449:0,19451:0,19452:0,321501:0,321502:0,321504:0,321505:0,321529:0,321541:0,321542:0,357757:0,691962:0,691963:0,691963:18;XREF GT:DP:AD:SB:XAAD:AL ./.:8:0,8:0,0,0,8:8:-1e+100,-1.49947 1 . 30052145 837884_837817_837813_837811_837810_1048847_1048846_837908_837907_837906_837904_837799_837798_837796 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT . -0 . DP=5;XSEE=19303:0,19305:0,19307:0,19321:0,19322:0,19324:0,19329:0,19421:0,19448:0,19450:0,19451:0,19452:0,321501:0,321502:0,321504:0,321505:0,321529:0,321541:0,321542:0,357757:0,691962:0,691963:0,691963:18;XREF GT:DP:AD:SB:XAAD:AL ./.:5:0,5:0,0,0,5:5:-1e+100,-1.49947 1 . 30052145 837884_837817_837813_837811_837810_837804_837803_837801_837800_837799_837798_837796 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTAGGATTGCAACCCCT TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTATCAGAGACTGGGATTGCAACCCCT 1.52995e-11 . DP=11;XSEE=19303:0,19305:0,19307:0,19308:0,19310:0,19312:0,19313:0,19321:0,19322:0,19324:0,19329:0,19421:0,321501:0,321502:0,321504:0,321505:0,321529:0,357757:0,691962:0,691963:0,691963:18;XREF GT:DP:AD:SB:XAAD:AL 1/1:11:0,11:0,0,0,11:11:-1e+100,-1.49947 1 . 30052145 837884_837817_837816_837811_1048822_1048819 TTGTCTCTTTTGATCTTTGTTGGTTTAAAGTCTGTTTTA . -0 . DP=27;XSEE=19322:0,19328:0,19329:0,19421:0,321509:0,321514:0,321529:0,691962:0,691963:0,691963:18;XREF GT:DP:AD:SB:XAAD:AL 0/0:27:0,27:0,0,0,27:27:-1e+100,-1.60631

First, what are the IDs (837884_837817_837813_837811_1048822_1048819)? I initially thought they were node IDs, but they don't seem to exist when I try vg find -n 837884 -c 2 -x graph.xg | vg view - (I get an Assertionidx < this->size()' failed.` error, which I've gotten before when searching for a non-existent node https://github.com/vgteam/vg/issues/364).

Second, should the same position, ID, ref and alt be listed more than once? I haven't yet managed to get decomposing block subs or normalization through vt to work yet, so I'm not sure if this would clear up some of the duplications.

(Relatedly, the readme says vt is found at http://varianttools. sourceforge.net/Association/HomePage, but I'm assuming this should be http://genome.sph.umich.edu/wiki/Vt ?)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7qmv7RlYZdBLjs4nuFkW5j5Vp_2dks5qnyakgaJpZM4J3Wq2 .

hgibling commented 8 years ago

Hi Glenn,

Thanks for the clarification on the IDs--I see them now.

I was using 1.4.0-1130-gc77e33c. I updated to 1.4.0-1269-g2095926, but now cannot get call to complete:

Node 1237919 is on ref path but not traced in site 1237918 fwd to 1237921 fwd
terminate called after throwing an instance of 'std::runtime_error'
  what():  Extra ref node found

Node 1237919 can be found in the alternate graph but not the original, and I'm left with a truncated vcf (2373 entries compared to 44411 when I use v.1.4.0-1130). What's happening here? (I can share files if needed, but they're too big to attach here.)

glennhickey commented 8 years ago

This may be another known issue (this time in the mapper). But the fix is in a PR that hasn't made it to master yet https://github.com/vgteam/vg/pull/476

@adamnovak do you think that's what's going on?

We (UCSC) are using this frozen version of vg, which I should have shared in my last message

https://github.com/adamnovak/vg/tree/experiment-freeze

for the time being to avoid these moving parts-related issues as we finish up results for a paper. If you want to somehow share your data and command lines, I'd be happy to try it out with this version.

On Wed, Sep 14, 2016 at 4:34 PM, Heather Gibling notifications@github.com wrote:

Hi Glenn,

Thanks for the clarification on the IDs--I see them now.

I was using 1.4.0-1130-gc77e33c. I updated to 1.4.0-1269-g2095926, but now cannot get call to complete:

Node 1237919 is on ref path but not traced in site 1237918 fwd to 1237921 fwd terminate called after throwing an instance of 'std::runtime_error' what(): Extra ref node found

Node 1237919 can be found in the alternate graph but not the original, and I'm left with a truncated vcf (2373 entries compared to 44411 when I use v.1.4.0-1130). What's happening here? (I can share files if needed, but they're too big to attach here.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-247144685, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7u6EyhQeh9nayFKZHHVR7Ia4Qbirks5qqFpbgaJpZM4J3Wq2 .

adamnovak commented 8 years ago

I don't know why it's happening, and I don't know if #476 (which I just merged in) will fix it, but it sounds like you have a site where the reference path is somehow hitting it twice? Or where maybe part of the reference is being skipped when the reference path is traced out?

Can you pull out the augmented graph with the -A aug.vg option, index it with vg index -x aug.xg aug.vg, and then plot the region around the problem node with vg find -n 1237919 -c 10 -x aug.xg | vg view -dn - | dot -Tsvg -o problem.svg, and then show us the plot? It ought to show the weird extra reference node, the 10-hop region around it, and where the reference path falls in that region, and ought to help make the problem clearer.

hgibling commented 8 years ago

1237919 is there and looks like the ref hits it once (ref is 🗟):

problem.svg.gz

(with paths:) problem-paths.svg.gz

glennhickey commented 8 years ago

Thanks. The icons don't work for me but as far as I can see from that image, path "ref" goes through 1237920 and not 1237919. So it looks like something's up with this path (either branching out or cycling). Maybe vg validate (on both your original and augmented graph) will catch it. Still happy to try reproducing if you can somehow share the input to vg call.

On Thu, Sep 15, 2016 at 5:26 PM, Heather Gibling notifications@github.com wrote:

1237919 is there and looks like the ref hits it once (ref is 🗟):

problem.svg.gz https://github.com/vgteam/vg/files/475618/problem.svg.gz

(with paths:) problem-paths.svg.gz https://github.com/vgteam/vg/files/475623/problem-paths.svg.gz

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-247460288, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7kcYNHhg0nGgxRcvq8MEWnv-ldchks5qqbghgaJpZM4J3Wq2 .

hgibling commented 8 years ago

Whoops, here's a partial screenshot. ref is the... circled classified ad on a newspaper? It looks like it hits both 1237920 and 1237919.

screenshot from 2016-09-16 11-12-57

Both the original and augmented graphs validated fine. That would be great if you could take a look. I've emailed you a link to a tarball with the pileup and original graph, original graph index, and the augmented graph I got from vg call -A 1269-aug.vg -o 28477796 MHC-ref-haps-n.vg 1269.pileup > 1269.vcf

glennhickey commented 8 years ago

Thanks! I got the file and can reproduce your error. Will take a look...

On Fri, Sep 16, 2016 at 4:10 PM, Heather Gibling notifications@github.com wrote:

Whoops, here's a partial screenshot. ref is the... circled classified ad on a newspaper? It looks like it hits both 1237920 and 1237919.

[image: screenshot from 2016-09-16 11-12-57] https://cloud.githubusercontent.com/assets/7570351/18599163/ea91ccba-7c24-11e6-82e8-2a4bb8fc48e6.png

Both the original and augmented graphs validated fine. That would be great if you could take a look. I've emailed you a link to a tarball with the pileup and original graph, original graph index, and the augmented graph I got from vg call -A 1269-aug.vg -o 28477796 MHC-ref-haps-n.vg 1269.pileup > 1269.vcf

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-247696556, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7q04g-rWwiYt0276gZx9dwYPK8NRks5qqvewgaJpZM4J3Wq2 .

glennhickey commented 8 years ago

Both nodes are indeed in the ref path, which seems to take a big loop through that area twice

vg view 1269-aug.vg -j | jq .path | jq 'map(select(.name == "ref"))' | grep node | grep -A 3 -B 3 -n '1237919|1237920'

717632- "node_id": 1237916 717633- "node_id": 1237917 717634- "node_id": 1237918 _717635: "nodeid": 1237920 717636- "node_id": 1237921 717637- "node_id": 1237923

717638- "node_id": 349035

726903- "node_id": 1237916 726904- "node_id": 1237917 726905- "node_id": 1237918 _726906: "nodeid": 1237919 726907- "node_id": 1237921 726908- "node_id": 1237923 726909- "node_id": 1237924

On Fri, Sep 16, 2016 at 4:30 PM, Glenn Hickey glenn.hickey@gmail.com wrote:

Thanks! I got the file and can reproduce your error. Will take a look...

On Fri, Sep 16, 2016 at 4:10 PM, Heather Gibling <notifications@github.com

wrote:

Whoops, here's a partial screenshot. ref is the... circled classified ad on a newspaper? It looks like it hits both 1237920 and 1237919.

[image: screenshot from 2016-09-16 11-12-57] https://cloud.githubusercontent.com/assets/7570351/18599163/ea91ccba-7c24-11e6-82e8-2a4bb8fc48e6.png

Both the original and augmented graphs validated fine. That would be great if you could take a look. I've emailed you a link to a tarball with the pileup and original graph, original graph index, and the augmented graph I got from vg call -A 1269-aug.vg -o 28477796 MHC-ref-haps-n.vg 1269.pileup > 1269.vcf

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-247696556, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7q04g-rWwiYt0276gZx9dwYPK8NRks5qqvewgaJpZM4J3Wq2 .

glennhickey commented 8 years ago

Will set a few more tests going but probably won't be able to say much before monday...

On Fri, Sep 16, 2016 at 4:48 PM, Glenn Hickey glenn.hickey@gmail.com wrote:

Both nodes are indeed in the ref path, which seems to take a big loop through that area twice

vg view 1269-aug.vg -j | jq .path | jq 'map(select(.name == "ref"))' | grep node | grep -A 3 -B 3 -n '1237919|1237920'

717632- "node_id": 1237916 717633- "node_id": 1237917 717634- "node_id": 1237918 _717635: "nodeid": 1237920 717636- "node_id": 1237921 717637- "node_id": 1237923

717638- "node_id": 349035

726903- "node_id": 1237916 726904- "node_id": 1237917 726905- "node_id": 1237918 _726906: "nodeid": 1237919 726907- "node_id": 1237921 726908- "node_id": 1237923 726909- "node_id": 1237924

On Fri, Sep 16, 2016 at 4:30 PM, Glenn Hickey glenn.hickey@gmail.com wrote:

Thanks! I got the file and can reproduce your error. Will take a look...

On Fri, Sep 16, 2016 at 4:10 PM, Heather Gibling < notifications@github.com> wrote:

Whoops, here's a partial screenshot. ref is the... circled classified ad on a newspaper? It looks like it hits both 1237920 and 1237919.

[image: screenshot from 2016-09-16 11-12-57] https://cloud.githubusercontent.com/assets/7570351/18599163/ea91ccba-7c24-11e6-82e8-2a4bb8fc48e6.png

Both the original and augmented graphs validated fine. That would be great if you could take a look. I've emailed you a link to a tarball with the pileup and original graph, original graph index, and the augmented graph I got from vg call -A 1269-aug.vg -o 28477796 MHC-ref-haps-n.vg 1269.pileup > 1269.vcf

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-247696556, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7q04g-rWwiYt0276gZx9dwYPK8NRks5qqvewgaJpZM4J3Wq2 .

hgibling commented 8 years ago

No worries, thanks for taking the time to look!

glennhickey commented 8 years ago

Just taking a look at this again, and pretty sure it's the cycles that are throwing off vg call. Specifically, vg call operates under the assumption that the reference path is acyclic. So any graph made with vg construct should work. Mind if I ask how you made your graph?

Adding support for cycles here shouldn't be too big of an undertaking, but it will have to wait until we get a few other things fixed up.

On Fri, Sep 16, 2016 at 4:54 PM, Heather Gibling notifications@github.com wrote:

No worries, thanks for taking the time to look!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-247706764, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7gEy0HCoakUrO18F1iFyzjXhK9wTks5qqwIegaJpZM4J3Wq2 .

ekg commented 8 years ago

vg mod can unroll the graph so it becomes acyclic with sequences in cycles preserved up to some length (k).

On Mon, Sep 19, 2016, 21:46 Glenn Hickey notifications@github.com wrote:

Just taking a look at this again, and pretty sure it's the cycles that are throwing off vg call. Specifically, vg call operates under the assumption that the reference path is acyclic. So any graph made with vg construct should work. Mind if I ask how you made your graph?

Adding support for cycles here shouldn't be too big of an undertaking, but it will have to wait until we get a few other things fixed up.

On Fri, Sep 16, 2016 at 4:54 PM, Heather Gibling <notifications@github.com

wrote:

No worries, thanks for taking the time to look!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-247706764, or mute the thread < https://github.com/notifications/unsubscribe-auth/AA2_7gEy0HCoakUrO18F1iFyzjXhK9wTks5qqwIegaJpZM4J3Wq2

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248101901, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EXzQOMCjDO8IGPrYpFK5rKBKQBmPks5qruZ9gaJpZM4J3Wq2 .

adamnovak commented 8 years ago

Does it unspool embedded paths in a reasonable way?

On Mon, Sep 19, 2016 at 2:00 PM, Erik Garrison notifications@github.com wrote:

vg mod can unroll the graph so it becomes acyclic with sequences in cycles preserved up to some length (k).

On Mon, Sep 19, 2016, 21:46 Glenn Hickey notifications@github.com wrote:

Just taking a look at this again, and pretty sure it's the cycles that are throwing off vg call. Specifically, vg call operates under the assumption that the reference path is acyclic. So any graph made with vg construct should work. Mind if I ask how you made your graph?

Adding support for cycles here shouldn't be too big of an undertaking, but it will have to wait until we get a few other things fixed up.

On Fri, Sep 16, 2016 at 4:54 PM, Heather Gibling < notifications@github.com

wrote:

No worries, thanks for taking the time to look!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-247706764, or mute the thread < https://github.com/notifications/unsubscribe-auth/AA2_ 7gEy0HCoakUrO18F1iFyzjXhK9wTks5qqwIegaJpZM4J3Wq2

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248101901, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAI4EXzQOMCjDO8IGPrYpFK5rKBKQBmPks5qruZ9gaJpZM4J3Wq2

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248124971, or mute the thread https://github.com/notifications/unsubscribe-auth/AE0_Xxl9jyZ1bASJ9bBy6O0vdmJ1CudPks5qrvfUgaJpZM4J3Wq2 .

ekg commented 8 years ago

It could if we forced it to unroll all loops in any paths.

Right now we would need to realign against the graph unrolled to the maximum cycle length in any of the paths. This is tricky.

On Mon, Sep 19, 2016, 23:02 adamnovak notifications@github.com wrote:

Does it unspool embedded paths in a reasonable way?

On Mon, Sep 19, 2016 at 2:00 PM, Erik Garrison notifications@github.com wrote:

vg mod can unroll the graph so it becomes acyclic with sequences in cycles preserved up to some length (k).

On Mon, Sep 19, 2016, 21:46 Glenn Hickey notifications@github.com wrote:

Just taking a look at this again, and pretty sure it's the cycles that are throwing off vg call. Specifically, vg call operates under the assumption that the reference path is acyclic. So any graph made with vg construct should work. Mind if I ask how you made your graph?

Adding support for cycles here shouldn't be too big of an undertaking, but it will have to wait until we get a few other things fixed up.

On Fri, Sep 16, 2016 at 4:54 PM, Heather Gibling < notifications@github.com

wrote:

No worries, thanks for taking the time to look!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-247706764, or mute the thread < https://github.com/notifications/unsubscribe-auth/AA2_ 7gEy0HCoakUrO18F1iFyzjXhK9wTks5qqwIegaJpZM4J3Wq2

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248101901, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAI4EXzQOMCjDO8IGPrYpFK5rKBKQBmPks5qruZ9gaJpZM4J3Wq2

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248124971, or mute the thread < https://github.com/notifications/unsubscribe-auth/AE0_Xxl9jyZ1bASJ9bBy6O0vdmJ1CudPks5qrvfUgaJpZM4J3Wq2

.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248125542, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EQDwpt9dVjFPVcwSh79MeOum7JI4ks5qrvhagaJpZM4J3Wq2 .

ekg commented 8 years ago

Code will have to be written to get the path maintenance through the unroll right.

On Mon, Sep 19, 2016, 23:21 Erik Garrison erik.garrison@gmail.com wrote:

It could if we forced it to unroll all loops in any paths.

Right now we would need to realign against the graph unrolled to the maximum cycle length in any of the paths. This is tricky.

On Mon, Sep 19, 2016, 23:02 adamnovak notifications@github.com wrote:

Does it unspool embedded paths in a reasonable way?

On Mon, Sep 19, 2016 at 2:00 PM, Erik Garrison notifications@github.com wrote:

vg mod can unroll the graph so it becomes acyclic with sequences in cycles preserved up to some length (k).

On Mon, Sep 19, 2016, 21:46 Glenn Hickey notifications@github.com wrote:

Just taking a look at this again, and pretty sure it's the cycles that are throwing off vg call. Specifically, vg call operates under the assumption that the reference path is acyclic. So any graph made with vg construct should work. Mind if I ask how you made your graph?

Adding support for cycles here shouldn't be too big of an undertaking, but it will have to wait until we get a few other things fixed up.

On Fri, Sep 16, 2016 at 4:54 PM, Heather Gibling < notifications@github.com

wrote:

No worries, thanks for taking the time to look!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-247706764, or mute the thread < https://github.com/notifications/unsubscribe-auth/AA2_ 7gEy0HCoakUrO18F1iFyzjXhK9wTks5qqwIegaJpZM4J3Wq2

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248101901, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAI4EXzQOMCjDO8IGPrYpFK5rKBKQBmPks5qruZ9gaJpZM4J3Wq2

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248124971, or mute the thread < https://github.com/notifications/unsubscribe-auth/AE0_Xxl9jyZ1bASJ9bBy6O0vdmJ1CudPks5qrvfUgaJpZM4J3Wq2

.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248125542, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EQDwpt9dVjFPVcwSh79MeOum7JI4ks5qrvhagaJpZM4J3Wq2 .

hgibling commented 8 years ago

So any graph made with vg construct should work. Mind if I ask how you made your graph?

With construct and msga:

vg construct -R 6:28510120-33480577 -t 16 -m 32 -v SNPs.vcf -r ref-seq.fa > MHC-ref.vg
vg mod -D MHC-ref.vg > MHC-ref-nopath.vg
vg msga -g MHC-ref-nopath.vg -f MHC-haps.fa -B 128 -K 11 -X 2 -E 3 -G -S 0.95 -H 5 -z -D -t 16 > MHC-ref-haps.vg
vg mod -n MHC-ref-haps.vg > MHC-ref-haps-n.vg

vg call operates under the assumption that the reference path is acyclic.

This is good to know. (I built a cyclic graph for another project, but I don't think I'll be calling--just mapping.)

ekg commented 8 years ago

You could try to control the amount of cycling using changes to the way that msga runs. If you are on a new version msga will use the "MEM threader" which clusters then walks through the MEMs using the positional index provided by the paths to make clusters. This can generally tolerate larger bandwidth -B setting. Setting -B higher will decrease the likelihood of cyclic structures in the graph, up to the limit of -B being the length of the input sequences. At that point the output is equivalent to an old fashioned MSA, and will be a DAG.

It's good to know that you have been able to use a cyclic graph for mapping. Variant calling on them should be possible but we have this problem about the unroll not respecting the paths.

On Tue, Sep 20, 2016, 05:36 Heather Gibling notifications@github.com wrote:

So any graph made with vg construct should work. Mind if I ask how you made your graph?

With construct and msga:

vg construct -R 6:28510120-33480577 -t 16 -m 32 -v SNPs.vcf -r ref-seq.fa > MHC-ref.vg vg mod -D MHC-ref.vg > MHC-ref-nopath.vg vg msga -g MHC-ref-nopath.vg -f MHC-haps.fa -B 128 -K 11 -X 2 -E 3 -G -S 0.95 -H 5 -z -D -t 16 > MHC-ref-haps.vg vg mod -n MHC-ref-haps.vg > MHC-ref-haps-n.vg

vg call operates under the assumption that the reference path is acyclic.

This is good to know. (I built a cyclic graph for another project, but I don't think I'll be calling--just mapping.)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248192771, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EaqIb1FIjwKVgMoHw-qfhE6AuJzVks5qr1SdgaJpZM4J3Wq2 .

glennhickey commented 8 years ago

Without being too familiar with vg msga, is removing the vg mod -D step an option?

That way you'd keep the reference path from ref-seq.fa built in vg construct which, presumably, you want any output VCF based on anyway. Then so long as that path isn't folded up by vg msga, you should be good to go. I imagine this would entail removing the reference path from MHC-haps.fa as well.

On Tue, Sep 20, 2016 at 1:37 AM, Erik Garrison notifications@github.com wrote:

You could try to control the amount of cycling using changes to the way that msga runs. If you are on a new version msga will use the "MEM threader" which clusters then walks through the MEMs using the positional index provided by the paths to make clusters. This can generally tolerate larger bandwidth -B setting. Setting -B higher will decrease the likelihood of cyclic structures in the graph, up to the limit of -B being the length of the input sequences. At that point the output is equivalent to an old fashioned MSA, and will be a DAG.

It's good to know that you have been able to use a cyclic graph for mapping. Variant calling on them should be possible but we have this problem about the unroll not respecting the paths.

On Tue, Sep 20, 2016, 05:36 Heather Gibling notifications@github.com wrote:

So any graph made with vg construct should work. Mind if I ask how you made your graph?

With construct and msga:

vg construct -R 6:28510120-33480577 -t 16 -m 32 -v SNPs.vcf -r ref-seq.fa > MHC-ref.vg vg mod -D MHC-ref.vg > MHC-ref-nopath.vg vg msga -g MHC-ref-nopath.vg -f MHC-haps.fa -B 128 -K 11 -X 2 -E 3 -G -S 0.95 -H 5 -z -D -t 16 > MHC-ref-haps.vg vg mod -n MHC-ref-haps.vg > MHC-ref-haps-n.vg

vg call operates under the assumption that the reference path is acyclic.

This is good to know. (I built a cyclic graph for another project, but I don't think I'll be calling--just mapping.)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248192771, or mute the thread https://github.com/notifications/unsubscribe- auth/AAI4EaqIb1FIjwKVgMoHw-qfhE6AuJzVks5qr1SdgaJpZM4J3Wq2 .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-248206289, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7uobSB_k7wSbYIprT1T3_ST1O6BFks5qr3EdgaJpZM4J3Wq2 .

ekg commented 8 years ago

Here's a suggested process that should remove the cycles and make a more-linear graph:

vg construct -R 6:28510120-33480577 -t 16 -m 32 -v SNPs.vcf -r ref-seq.fa > MHC-ref.vg
vg mod -D MHC-ref.vg > MHC-ref-nopath.vg
vg msga -g MHC-ref-nopath.vg -f MHC-haps.fa -B 10000 -K 11 -X 2 -D -t 16 > MHC-ref-haps.vg
vg mod -b MHC-ref-haps.vg | vg mod -U - > MHC-ref-haps-n.vg

Changes. You could try one or both of these:

  1. Use 10k bands for the mapper in MSGA (larger is now possible as with the MEM-walking aligner we don't have to try to fill out the whole alignment matrix). This should cut down on weird looping alignments.
  2. Cut cycles (and inversions) in the graph. As Adam noted, I think this is going to make some paths invalid, but it's worth a try.

There are a few changes to the mapper mode used in MSGA ("mem-threading") that are about to land. Bugfix stuff, so test this today with caution. I'll ping back when those are in master.

@hgibling thanks for all the feedback and help. Let's get this to work.

hgibling commented 7 years ago

Sorry for the delay (had to focus on a side project for a bit).

I'm trying Erik's suggested process with newer versions of vg, but cannot get the msga to work:

...
LCPArray::LCPArray(): Construction: 0.593201 seconds, 15.1799 GB
LCPArray::LCPArray(): 19442735 values at 6 levels (branching factor 64)
gi224183341|gb|GL000251.1|: adding to graph1
gi224183341|gb|GL000251.1|: aligning sequence of 4795370bp against 891887 nodes
no alignments returned from walk match with I_2TCTGGCCTGGGAGTCACAGCT 24 1+12
vg: src/mapper.cpp:2142: vg::Alignment vg::Mapper::walk_match(const string&, vg::pos_t): Assertion `false' failed.

^I get this using both v1.4.0-1414-g72c1c30 and v1.4.0-1318-gf273f81. Are there issues with the mem-threading? I also tried with -B 1000 but received the same error.

Using the original arguments I had to build the graph also fails:

LCPArray::LCPArray(): Construction: 0.475991 seconds, 6.84095 GB
LCPArray::LCPArray(): 17017201 values at 5 levels (branching factor 64)
gi224183341|gb|GL000251.1|: adding to graph1
gi224183341|gb|GL000251.1|: aligning sequence of 4795370bp against 891956 nodes
no alignments returned from walk match with I_2TCTGGCCTGGGAGTCACAG 22 1+12
vg: src/mapper.cpp:2142: vg::Alignment vg::Mapper::walk_match(const string&, vg::pos_t): Assertion `false' failed.

Ideas on what's happening?

ekg commented 7 years ago

There is a problem with the mem threading in master. I am trying to sync my branch into master today but meeting heavy resistance from git.

On Mon, Nov 14, 2016, 20:59 Heather Gibling notifications@github.com wrote:

Sorry for the delay (had to focus on a side project for a bit).

I'm trying Erik's suggested process with newer versions of vg, but cannot get the msga to work:

... LCPArray::LCPArray(): Construction: 0.593201 seconds, 15.1799 GB LCPArray::LCPArray(): 19442735 values at 6 levels (branching factor 64) gi224183341|gb|GL000251.1|: adding to graph1 gi224183341|gb|GL000251.1|: aligning sequence of 4795370bp against 891887 nodes no alignments returned from walk match with I_2TCTGGCCTGGGAGTCACAGCT 24 1+12 vg: src/mapper.cpp:2142: vg::Alignment vg::Mapper::walk_match(const string&, vg::pos_t): Assertion `false' failed.

^I get this using both v1.4.0-1414-g72c1c30 and v1.4.0-1318-gf273f81. Are there issues with the mem-threading? I also tried with -B 1000 but received the same error.

Using the original arguments I had to build the graph also fails:

LCPArray::LCPArray(): Construction: 0.475991 seconds, 6.84095 GB LCPArray::LCPArray(): 17017201 values at 5 levels (branching factor 64) gi224183341|gb|GL000251.1|: adding to graph1 gi224183341|gb|GL000251.1|: aligning sequence of 4795370bp against 891956 nodes no alignments returned from walk match with I_2TCTGGCCTGGGAGTCACAG 22 1+12 vg: src/mapper.cpp:2142: vg::Alignment vg::Mapper::walk_match(const string&, vg::pos_t): Assertion `false' failed.

Ideas on what's happening?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-260445100, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4ETvsiitMaKLjPKfecQt9Ku551Ibyks5q-L2fgaJpZM4J3Wq2 .

adamnovak commented 7 years ago

@ekg has merged #493 and so I think the MEM threader is in now.

ekg commented 7 years ago

It is not set by default yet. We are still working on some updates to gssw and the alignment pinning that are required to exceed the performance of the default mode even with high error rates.

On Fri, Nov 25, 2016, 19:21 Adam Novak notifications@github.com wrote:

@ekg https://github.com/ekg has merged #493 https://github.com/vgteam/vg/pull/493 and so I think the MEM threader is in now.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-263007746, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4Eecrnevk7LEmMuTNS9KOzEIFKKzhks5rBydGgaJpZM4J3Wq2 .

hgibling commented 7 years ago

It's been a while, but I'm trying this again with v1.4.0-2212:

vg construct -R 6:28477797-33448354 -t 8 -m 32 -v chr6.vcf.gz -r ref-seq.fa -f > MHC-ref.vg
vg mod -D MHC-ref.vg > MHC-ref-nopath.vg
vg msga -g MHC-ref-nopath.vg -f MHC-haps.fa -B 1000 -K 11 -X 2 -D -t 8 > MHC-ref-haps.vg

I tried with -B 10000 but that was using more memory than our cluster could handle. Not sure what to make of the error:

preparing initial graph
building xg index
building GCSA2 index
InputGraph::InputGraph(): 15565700 kmers in 1 file(s)
InputGraph::read(): Read 15565700 11-mers from .vg-kmers-tmp-XXcBpgjR
InputGraph::readKeys(): 3082364 unique keys
InputGraph::read(): Read 15565700 11-mers from .vg-kmers-tmp-XXcBpgjR
InputGraph::readFrom(): 10294918 unique from nodes
InputGraph::read(): Read 15565700 11-mers from .vg-kmers-tmp-XXcBpgjR
PathGraph::PathGraph(): 15565700 paths with 31131400 ranks
PathGraph::PathGraph(): 0.463894 GB in 1 file(s)
GCSA::GCSA(): Preprocessing: 10.4081 seconds, 1.93951 GB
GCSA::GCSA(): Prefix-doubling from path length 11
GCSA::GCSA(): Step 1 (path length 11 -> 22)
PathGraph::prune(): 15565700 -> 15511623 paths (3076855 ranges)
PathGraph::prune(): 794024 unique, 0 redundant, 14717599 unsorted, 0 nondeterministic paths
PathGraph::prune(): 0.462282 GB in 1 file(s)
PathGraph::read(): File 0: Read 15511623 order-11 paths
PathGraph::extend(): File 0: Created 28208516 order-22 paths
PathGraph::read(): File 0: Read 28208516 order-22 paths
PathGraphBuilder::sort(): File 0: Sorted 28208516 paths
PathGraph::extend(): 15511623 -> 28208516 paths (83831524 ranks)
PathGraph::extend(): 0.942806 GB in 1 file(s)
GCSA::GCSA(): Step 2 (path length 22 -> 44)
PathGraph::prune(): 28208516 -> 17759252 paths (15577856 ranges)
PathGraph::prune(): 14756138 unique, 0 redundant, 2992751 unsorted, 10363 nondeterministic paths
PathGraph::prune(): 0.592467 GB in 1 file(s)
PathGraph::read(): File 0: Read 17759252 order-22 paths
PathGraph::extend(): File 0: Created 22817358 order-44 paths
PathGraph::read(): File 0: Read 22817358 order-44 paths
PathGraphBuilder::sort(): File 0: Sorted 22817358 paths
PathGraph::extend(): 17759252 -> 22817358 paths (83540499 ranks)
PathGraph::extend(): 0.82122 GB in 1 file(s)
GCSA::GCSA(): Prefix-doubling: 47.4414 seconds, 2.52268 GB
GCSA::GCSA(): Merging the paths
MergedGraph::MergedGraph(): 18761721 paths with 62861414 ranks and 837433 additional from nodes
MergedGraph::MergedGraph(): 0.683486 GB
GCSA::GCSA(): Merging: 11.5354 seconds, 2.52268 GB
GCSA::GCSA(): Building the index
GCSA::GCSA(): Construction: 26.3434 seconds, 2.52268 GB
GCSA::GCSA(): 18761721 paths, 19802970 edges
GCSA::GCSA(): 19599154 pointers (9304236 redundant)
GCSA::GCSA(): 3187639 samples at 2790108 positions
LCPArray::LCPArray(): Construction: 0.673589 seconds, 2.52268 GB
LCPArray::LCPArray(): 19059529 values at 6 levels (branching factor 64)
ref: adding to graph1
ref: aligning sequence of 4970557bp against 573837 nodes
ref: editing graph
ref: sorting and compacting ids
building xg index
building GCSA2 index
InputGraph::InputGraph(): 15573032 kmers in 1 file(s)
InputGraph::read(): Read 15573032 11-mers from .vg-kmers-tmp-XX0gfqTC
InputGraph::readKeys(): 3082757 unique keys
InputGraph::read(): Read 15573032 11-mers from .vg-kmers-tmp-XX0gfqTC
InputGraph::readFrom(): 10295018 unique from nodes
InputGraph::read(): Read 15573032 11-mers from .vg-kmers-tmp-XX0gfqTC
PathGraph::PathGraph(): 15573032 paths with 31146064 ranks
PathGraph::PathGraph(): 0.464113 GB in 1 file(s)
GCSA::GCSA(): Preprocessing: 10.0734 seconds, 4.05677 GB
GCSA::GCSA(): Prefix-doubling from path length 11
GCSA::GCSA(): Step 1 (path length 11 -> 22)
PathGraph::prune(): 15573032 -> 15518875 paths (3077248 ranges)
PathGraph::prune(): 793893 unique, 0 redundant, 14724982 unsorted, 0 nondeterministic paths
PathGraph::prune(): 0.462499 GB in 1 file(s)
PathGraph::read(): File 0: Read 15518875 order-11 paths
PathGraph::extend(): File 0: Created 28236257 order-22 paths
PathGraph::read(): File 0: Read 28236257 order-22 paths
PathGraphBuilder::sort(): File 0: Sorted 28236257 paths
PathGraph::extend(): 15518875 -> 28236257 paths (83914878 ranks)
PathGraph::extend(): 0.943737 GB in 1 file(s)
GCSA::GCSA(): Step 2 (path length 22 -> 44)
PathGraph::prune(): 28236257 -> 17773107 paths (15585140 ranges)
PathGraph::prune(): 14761744 unique, 0 redundant, 3000924 unsorted, 10439 nondeterministic paths
PathGraph::prune(): 0.592932 GB in 1 file(s)
PathGraph::read(): File 0: Read 17773107 order-22 paths
PathGraph::extend(): File 0: Created 22877942 order-44 paths
PathGraph::read(): File 0: Read 22877942 order-44 paths
PathGraphBuilder::sort(): File 0: Sorted 22877942 paths
PathGraph::extend(): 17773107 -> 22877942 paths (83831103 ranks)
PathGraph::extend(): 0.823657 GB in 1 file(s)
GCSA::GCSA(): Prefix-doubling: 46.9834 seconds, 4.09481 GB
GCSA::GCSA(): Merging the paths
MergedGraph::MergedGraph(): 18772860 paths with 62904019 ranks and 843401 additional from nodes
MergedGraph::MergedGraph(): 0.683993 GB
GCSA::GCSA(): Merging: 9.54995 seconds, 4.09481 GB
GCSA::GCSA(): Building the index
GCSA::GCSA(): Construction: 21.8281 seconds, 4.09481 GB
GCSA::GCSA(): 18772860 paths, 19816094 edges
GCSA::GCSA(): 19616261 pointers (9321243 redundant)
GCSA::GCSA(): 3594587 samples at 3165728 positions
LCPArray::LCPArray(): Construction: 0.677416 seconds, 4.09481 GB
LCPArray::LCPArray(): 19070845 values at 6 levels (branching factor 64)
gi224183346|gb|GL000256.1|: adding to graph1
gi224183346|gb|GL000256.1|: aligning sequence of 4928568bp against 705332 nodes
terminate called after throwing an instance of 'std::runtime_error'
  what():  Cannot simplify Mapping with no position; must update position when removing leading deletions
./msga-MHC-ref-nopath.sh: line 8: 27895 Aborted                 (core dumped)
adamnovak commented 7 years ago

Does your build include < https://github.com/vgteam/vg/commit/d0a0ceac6244551b512f28f63b3b9a11edfd4ae6>? That commit ought to have fixed this issue, which we used to get when chunks of long reads don't align to anything.

On Wed, Mar 29, 2017 at 10:59 AM, Heather Gibling notifications@github.com wrote:

It's been a while, but I'm trying this again with v1.4.0-2212:

vg construct -R 6:28477797-33448354 -t 8 -m 32 -v chr6.vcf.gz -r ref-seq.fa -f > MHC-ref.vg vg mod -D MHC-ref.vg > MHC-ref-nopath.vg vg msga -g MHC-ref-nopath.vg -f MHC-haps.fa -B 1000 -K 11 -X 2 -D -t 8 > MHC-ref-haps.vg

I tried with -B 10000 but that was using more memory than our cluster could handle. Not sure what to make of the error:

preparing initial graph building xg index building GCSA2 index InputGraph::InputGraph(): 15565700 kmers in 1 file(s) InputGraph::read(): Read 15565700 11-mers from .vg-kmers-tmp-XXcBpgjR InputGraph::readKeys(): 3082364 unique keys InputGraph::read(): Read 15565700 11-mers from .vg-kmers-tmp-XXcBpgjR InputGraph::readFrom(): 10294918 unique from nodes InputGraph::read(): Read 15565700 11-mers from .vg-kmers-tmp-XXcBpgjR PathGraph::PathGraph(): 15565700 paths with 31131400 ranks PathGraph::PathGraph(): 0.463894 GB in 1 file(s) GCSA::GCSA(): Preprocessing: 10.4081 seconds, 1.93951 GB GCSA::GCSA(): Prefix-doubling from path length 11 GCSA::GCSA(): Step 1 (path length 11 -> 22) PathGraph::prune(): 15565700 -> 15511623 paths (3076855 ranges) PathGraph::prune(): 794024 unique, 0 redundant, 14717599 unsorted, 0 nondeterministic paths PathGraph::prune(): 0.462282 GB in 1 file(s) PathGraph::read(): File 0: Read 15511623 order-11 paths PathGraph::extend(): File 0: Created 28208516 order-22 paths PathGraph::read(): File 0: Read 28208516 order-22 paths PathGraphBuilder::sort(): File 0: Sorted 28208516 paths PathGraph::extend(): 15511623 -> 28208516 paths (83831524 ranks) PathGraph::extend(): 0.942806 GB in 1 file(s) GCSA::GCSA(): Step 2 (path length 22 -> 44) PathGraph::prune(): 28208516 -> 17759252 paths (15577856 ranges) PathGraph::prune(): 14756138 unique, 0 redundant, 2992751 unsorted, 10363 nondeterministic paths PathGraph::prune(): 0.592467 GB in 1 file(s) PathGraph::read(): File 0: Read 17759252 order-22 paths PathGraph::extend(): File 0: Created 22817358 order-44 paths PathGraph::read(): File 0: Read 22817358 order-44 paths PathGraphBuilder::sort(): File 0: Sorted 22817358 paths PathGraph::extend(): 17759252 -> 22817358 paths (83540499 ranks) PathGraph::extend(): 0.82122 GB in 1 file(s) GCSA::GCSA(): Prefix-doubling: 47.4414 seconds, 2.52268 GB GCSA::GCSA(): Merging the paths MergedGraph::MergedGraph(): 18761721 paths with 62861414 ranks and 837433 additional from nodes MergedGraph::MergedGraph(): 0.683486 GB GCSA::GCSA(): Merging: 11.5354 seconds, 2.52268 GB GCSA::GCSA(): Building the index GCSA::GCSA(): Construction: 26.3434 seconds, 2.52268 GB GCSA::GCSA(): 18761721 paths, 19802970 edges GCSA::GCSA(): 19599154 pointers (9304236 redundant) GCSA::GCSA(): 3187639 samples at 2790108 positions LCPArray::LCPArray(): Construction: 0.673589 seconds, 2.52268 GB LCPArray::LCPArray(): 19059529 values at 6 levels (branching factor 64) ref: adding to graph1 ref: aligning sequence of 4970557bp against 573837 nodes ref: editing graph ref: sorting and compacting ids building xg index building GCSA2 index InputGraph::InputGraph(): 15573032 kmers in 1 file(s) InputGraph::read(): Read 15573032 11-mers from .vg-kmers-tmp-XX0gfqTC InputGraph::readKeys(): 3082757 unique keys InputGraph::read(): Read 15573032 11-mers from .vg-kmers-tmp-XX0gfqTC InputGraph::readFrom(): 10295018 unique from nodes InputGraph::read(): Read 15573032 11-mers from .vg-kmers-tmp-XX0gfqTC PathGraph::PathGraph(): 15573032 paths with 31146064 ranks PathGraph::PathGraph(): 0.464113 GB in 1 file(s) GCSA::GCSA(): Preprocessing: 10.0734 seconds, 4.05677 GB GCSA::GCSA(): Prefix-doubling from path length 11 GCSA::GCSA(): Step 1 (path length 11 -> 22) PathGraph::prune(): 15573032 -> 15518875 paths (3077248 ranges) PathGraph::prune(): 793893 unique, 0 redundant, 14724982 unsorted, 0 nondeterministic paths PathGraph::prune(): 0.462499 GB in 1 file(s) PathGraph::read(): File 0: Read 15518875 order-11 paths PathGraph::extend(): File 0: Created 28236257 order-22 paths PathGraph::read(): File 0: Read 28236257 order-22 paths PathGraphBuilder::sort(): File 0: Sorted 28236257 paths PathGraph::extend(): 15518875 -> 28236257 paths (83914878 ranks) PathGraph::extend(): 0.943737 GB in 1 file(s) GCSA::GCSA(): Step 2 (path length 22 -> 44) PathGraph::prune(): 28236257 -> 17773107 paths (15585140 ranges) PathGraph::prune(): 14761744 unique, 0 redundant, 3000924 unsorted, 10439 nondeterministic paths PathGraph::prune(): 0.592932 GB in 1 file(s) PathGraph::read(): File 0: Read 17773107 order-22 paths PathGraph::extend(): File 0: Created 22877942 order-44 paths PathGraph::read(): File 0: Read 22877942 order-44 paths PathGraphBuilder::sort(): File 0: Sorted 22877942 paths PathGraph::extend(): 17773107 -> 22877942 paths (83831103 ranks) PathGraph::extend(): 0.823657 GB in 1 file(s) GCSA::GCSA(): Prefix-doubling: 46.9834 seconds, 4.09481 GB GCSA::GCSA(): Merging the paths MergedGraph::MergedGraph(): 18772860 paths with 62904019 ranks and 843401 additional from nodes MergedGraph::MergedGraph(): 0.683993 GB GCSA::GCSA(): Merging: 9.54995 seconds, 4.09481 GB GCSA::GCSA(): Building the index GCSA::GCSA(): Construction: 21.8281 seconds, 4.09481 GB GCSA::GCSA(): 18772860 paths, 19816094 edges GCSA::GCSA(): 19616261 pointers (9321243 redundant) GCSA::GCSA(): 3594587 samples at 3165728 positions LCPArray::LCPArray(): Construction: 0.677416 seconds, 4.09481 GB LCPArray::LCPArray(): 19070845 values at 6 levels (branching factor 64) gi224183346|gb|GL000256.1|: adding to graph1 gi224183346|gb|GL000256.1|: aligning sequence of 4928568bp against 705332 nodes terminate called after throwing an instance of 'std::runtime_error' what(): Cannot simplify Mapping with no position; must update position when removing leading deletions ./msga-MHC-ref-nopath.sh: line 8: 27895 Aborted (core dumped)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-290171793, or mute the thread https://github.com/notifications/unsubscribe-auth/AE0_X6JAOcXKW4RCIQ1-abbBRo3DMQToks5rqpvygaJpZM4J3Wq2 .

hgibling commented 7 years ago

It did not. Tried again with v1.5.0-31 and -B 10000:

...
LCPArray::LCPArray(): Construction: 0.684936 seconds, 4.75183 GB
LCPArray::LCPArray(): 19059560 values at 6 levels (branching factor 64)
gi224183346|gb|GL000256.1|: adding to graph1
gi224183346|gb|GL000256.1|: aligning sequence of 4928568bp against 705195 nodes
gi224183346|gb|GL000256.1|: editing graph
graph path 'gi224183346|gb|GL000256.1|' invalid: edge from 723574 end to 723575 start does not exist
creating edge
graph path 'gi224183346|gb|GL000256.1|' invalid: edge from 723575 end to 723576 start does not exist
creating edge
...
graph path 'gi224183346|gb|GL000256.1|' invalid: edge from 744886 end to 744887 start does not exist
creating edge
gi224183346|gb|GL000256.1|: sorting and compacting ids
building xg index
building GCSA2 index
InputGraph::InputGraph(): 17784208 kmers in 1 file(s)
InputGraph::read(): Read 17784208 11-mers from .vg-kmers-tmp-XXQ6MpEC
InputGraph::readKeys(): 3109455 unique keys
InputGraph::read(): Read 17784208 11-mers from .vg-kmers-tmp-XXQ6MpEC
InputGraph::readFrom(): 12138204 unique from nodes
InputGraph::read(): Read 17784208 11-mers from .vg-kmers-tmp-XXQ6MpEC
PathGraph::PathGraph(): 17784208 paths with 35568416 ranks
PathGraph::PathGraph(): 0.530011 GB in 1 file(s)
GCSA::GCSA(): Preprocessing: 11.8333 seconds, 8.06233 GB
GCSA::GCSA(): Prefix-doubling from path length 11
GCSA::GCSA(): Step 1 (path length 11 -> 22)
PathGraph::prune(): 17784208 -> 17726985 paths (3103834 ranges)
PathGraph::prune(): 782911 unique, 0 redundant, 16944054 unsorted, 20 nondeterministic paths
...
GCSA::GCSA(): Prefix-doubling: 53.3279 seconds, 8.06233 GB
GCSA::GCSA(): Merging the paths
MergedGraph::MergedGraph(): 20687934 paths with 71009688 ranks and 2702625 additional from nodes
MergedGraph::MergedGraph(): 0.786482 GB
GCSA::GCSA(): Merging: 20.772 seconds, 8.06233 GB
GCSA::GCSA(): Building the index
GCSA::GCSA(): Construction: 40.9665 seconds, 8.06233 GB
GCSA::GCSA(): 20687934 paths, 22083495 edges
GCSA::GCSA(): 23390559 pointers (11252355 redundant)
GCSA::GCSA(): 6070783 samples at 3949188 positions
LCPArray::LCPArray(): Construction: 0.604448 seconds, 8.06233 GB
LCPArray::LCPArray(): 21016316 values at 6 levels (branching factor 64)
[vg msga] failed to include alignment, retrying 
expected 000256.1| Homo sapiens chromosome 6 genomic contig, GRCh37 reference assembly alternate locus group ALT_REF_LOCI_7TCGGATTAAG...
got 000256.1| Homo sapiens chromosome 6 genomic contig, GRCh37 reference assembly alternate locus group ALT_REF_LOCI_7TCGGATTAAGTGC...
{"mapping": [{"position": {"node_id": 22391, "offset": 11}, "edit": [{"to_length": 114, "sequence": "000256.1| Homo sapiens chromosome 6 genomic ...
{"name": "gi224183346|gb|GL000256.1|", "mapping": [{"position": {"node_id": 13502}, "edit": [{"from_length": 22, "to_length": 22}], "rank": ...
[vg msga] Error: failed to include path gi224183346|gb|GL000256.1|
ekg commented 7 years ago

Looks like msga needs some attention. Sorry if I missed but is the input you are testing on available? I'd like to reproduce this.

On Sun, Apr 2, 2017, 8:23 PM Heather Gibling notifications@github.com wrote:

It did not. Tried again with v1.5.0-31 and -B 10000:

... LCPArray::LCPArray(): Construction: 0.684936 seconds, 4.75183 GB LCPArray::LCPArray(): 19059560 values at 6 levels (branching factor 64) gi224183346|gb|GL000256.1|: adding to graph1 gi224183346|gb|GL000256.1|: aligning sequence of 4928568bp against 705195 nodes gi224183346|gb|GL000256.1|: editing graph graph path 'gi224183346|gb|GL000256.1|' invalid: edge from 723574 end to 723575 start does not exist creating edge graph path 'gi224183346|gb|GL000256.1|' invalid: edge from 723575 end to 723576 start does not exist creating edge ... graph path 'gi224183346|gb|GL000256.1|' invalid: edge from 744886 end to 744887 start does not exist creating edge gi224183346|gb|GL000256.1|: sorting and compacting ids building xg index building GCSA2 index InputGraph::InputGraph(): 17784208 kmers in 1 file(s) InputGraph::read(): Read 17784208 11-mers from .vg-kmers-tmp-XXQ6MpEC InputGraph::readKeys(): 3109455 unique keys InputGraph::read(): Read 17784208 11-mers from .vg-kmers-tmp-XXQ6MpEC InputGraph::readFrom(): 12138204 unique from nodes InputGraph::read(): Read 17784208 11-mers from .vg-kmers-tmp-XXQ6MpEC PathGraph::PathGraph(): 17784208 paths with 35568416 ranks PathGraph::PathGraph(): 0.530011 GB in 1 file(s) GCSA::GCSA(): Preprocessing: 11.8333 seconds, 8.06233 GB GCSA::GCSA(): Prefix-doubling from path length 11 GCSA::GCSA(): Step 1 (path length 11 -> 22) PathGraph::prune(): 17784208 -> 17726985 paths (3103834 ranges) PathGraph::prune(): 782911 unique, 0 redundant, 16944054 unsorted, 20 nondeterministic paths ... GCSA::GCSA(): Prefix-doubling: 53.3279 seconds, 8.06233 GB GCSA::GCSA(): Merging the paths MergedGraph::MergedGraph(): 20687934 paths with 71009688 ranks and 2702625 additional from nodes MergedGraph::MergedGraph(): 0.786482 GB GCSA::GCSA(): Merging: 20.772 seconds, 8.06233 GB GCSA::GCSA(): Building the index GCSA::GCSA(): Construction: 40.9665 seconds, 8.06233 GB GCSA::GCSA(): 20687934 paths, 22083495 edges GCSA::GCSA(): 23390559 pointers (11252355 redundant) GCSA::GCSA(): 6070783 samples at 3949188 positions LCPArray::LCPArray(): Construction: 0.604448 seconds, 8.06233 GB LCPArray::LCPArray(): 21016316 values at 6 levels (branching factor 64) [vg msga] failed to include alignment, retrying expected 000256.1| Homo sapiens chromosome 6 genomic contig, GRCh37 reference assembly alternate locus group ALT_REF_LOCI_7TCGGATTAAG... got 000256.1| Homo sapiens chromosome 6 genomic contig, GRCh37 reference assembly alternate locus group ALT_REF_LOCI_7TCGGATTAAGTGC... {"mapping": [{"position": {"node_id": 22391, "offset": 11}, "edit": [{"to_length": 114, "sequence": "000256.1| Homo sapiens chromosome 6 genomic ... {"name": "gi224183346|gb|GL000256.1|", "mapping": [{"position": {"node_id": 13502}, "edit": [{"from_length": 22, "to_length": 22}], "rank": ... [vg msga] Error: failed to include path gi224183346|gb|GL000256.1|

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/469#issuecomment-291004574, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EQadlrYqqjiBiLJrJ8Kr7V0Im4kZks5rr-emgaJpZM4J3Wq2 .

hgibling commented 7 years ago

A similar output can be reproduced with the following:

vcf: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz ref: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz haplotypes: http://hgwdev.sdsc.edu/~anovak/hgvm/MHC/

cat ref.fa GI* > ref-alt.fa

vg construct -R 6:28477797-33448354 -t 8 -m 32 -f -v ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -r human_g1k_v37.fasta > MHC-ref.vg
vg mod -D MHC-ref.vg > MHC-ref-nopath.vg
vg msga -f ref-alt.fa -g MHC-ref-nopath.vg -B 5000 -K 11 -X 2 -z -D -t 8 > MHC-ref-haps.vg