vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 195 forks source link

clarification of xg warning during msga #363

Closed hgibling closed 8 years ago

hgibling commented 8 years ago

Hello,

I'm using vg to build a graph of the MHC region on chromosome 6. I constructed a graph using the GRCh38 reference sequence sequence and the 1000 genomes project VCF SNPs file for chromosome 6:

vg construct -R 6:28510120-33480577 -t 16 -m 32 -v SNPs-vcf -r reference-sequence > GRCh38-MHC-ref.vg

And then aligned the alternate haplotypes, following the parameters from the wiki on msga, but with the -z flag for retaining parts of the original graph that aren't in the input paths:

vg msga -g GRCh38-MHC-ref.vg -f alt-MHC.fa -B 128 -K 11 -X 2 -E 3 -G -S 0.95 -H 5 -z -D -t 16 > GRCh38-MHC-haps.vg

I get a new graph with more nodes and edges, but also several warnings:

[xg] warning: graph does not have edge from 3- to 5+ for path 6
[xg] warning: graph does not have edge from 5- to 6+ for path 6
[xg] warning: graph does not have edge from 10- to 12+ for path 6
[xg] warning: graph does not have edge from 17- to 19+ for path 6
[xg] warning: graph does not have edge from 20- to 21+ for path 6
...
[xg] warning: graph does not have edge from 594871- to 594872+ for path 6
[xg] warning: graph does not have edge from 594872- to 594874+ for path 6
[xg] warning: graph does not have edge from 594874- to 594875+ for path 6
[xg] warning: graph does not have edge from 594875- to 594877+ for path 6
[xg] warning: graph does not have edge from 594877- to 594878+ for path 6

I want to clarify these warnings: is this just telling me that GRCh38-MHC-ref.vg did not have an edge from node 3 to node 5 (path 6 being chromosome 6), but with the multiple sequence/graph alignment of the alternate haplotypes, a new edge has been added in GRCh38-MHC-haps.vg? Or is it saying there should be an edge between nodes 3 and 5 to accommodate the new input sequences, but the edge was not able to be incorporated in GRCh38-MHC-haps.vg?

Thanks

ekg commented 8 years ago

It looks like it is failing during the msga to retain the paths. Remove the -z flag?

On Wed, May 25, 2016, 19:41 Heather Gibling notifications@github.com wrote:

Hello,

I'm using vg to build a graph of the MHC region on chromosome 6. I constructed a graph using the GRCh38 reference sequence http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa sequence and the 1000 genomes project VCF SNPs file http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions for chromosome 6:

vg construct -R 6:28510120-33480577 -t 16 -m 32 -v SNPs-vcf -r reference-sequence > GRCh38-MHC-ref.vg

And then aligned the alternate haplotypes http://hgwdev.sdsc.edu/%7Eanovak/hgvm/MHC/, following the parameters from the wiki on msga https://github.com/vgteam/vg/wiki/Long-read-assemblies-using-vg-msga#example, but with the -z flag for retaining parts of the original graph that aren't in the input paths:

vg msga -g GRCh38-MHC-ref.vg -f alt-MHC.fa -B 128 -K 11 -X 2 -E 3 -G -S 0.95 -H 5 -z -D -t 16 > GRCh38-MHC-haps.vg

I get a new graph with more nodes and edges, but also several warnings:

[xg] warning: graph does not have edge from 3- to 5+ for path 6 [xg] warning: graph does not have edge from 5- to 6+ for path 6 [xg] warning: graph does not have edge from 10- to 12+ for path 6 [xg] warning: graph does not have edge from 17- to 19+ for path 6 [xg] warning: graph does not have edge from 20- to 21+ for path 6 ... [xg] warning: graph does not have edge from 594871- to 594872+ for path 6 [xg] warning: graph does not have edge from 594872- to 594874+ for path 6 [xg] warning: graph does not have edge from 594874- to 594875+ for path 6 [xg] warning: graph does not have edge from 594875- to 594877+ for path 6 [xg] warning: graph does not have edge from 594877- to 594878+ for path 6

I want to clarify these warnings: is this just telling me that GRCh38-MHC-ref.vg did not have an edge from node 3 to node 5 (path 6 being chromosome 6), but with the multiple sequence/graph alignment of the alternate haplotypes, a new edge has been added in GRCh38-MHC-haps.vg? Or is it saying there should be an edge between nodes 3 and 5 to accommodate the new input sequences, but the edge was not able to be incorporated in GRCh38-MHC-haps.vg?

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/363

hgibling commented 8 years ago

I get the same long list of warnings without -z

ekg commented 8 years ago

This warning indicates something is not working correctly. We have included a path in a previous iteration and it is now not fully embedded in the graph after editing.

Would it be possible to share the test case inputs in a zip file here?

On Wed, May 25, 2016, 19:58 Heather Gibling notifications@github.com wrote:

I get the same long list of warnings without -z

— You are receiving this because you commented.

Reply to this email directly or view it on GitHub https://github.com/vgteam/vg/issues/363#issuecomment-221672897

hgibling commented 8 years ago

The sequences and vcf are online:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr6.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP_no_SVs.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chr6.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP_no_SVs.vcf.gz.tbi

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

# rename reference sequences to match vcf
sed 's/>chr/>/' GRCh38_full_analysis_set_plus_decoy_hla.fa > GRCh38_full_analysis_set_plus_decoy_hla-no-chr.fa

wget http://hgwdev.sdsc.edu/~anovak/hgvm/hgvm.tar.gz
tar zxvf hgvm.tar.gz
cat hgvm/MHC/ref.fa hgvm/MHC/G*fa > MHC.fa

vg construct -R 6:28510120-33480577 -t 16 -m 32 -v ALL.chr6.phase3_shapeit2_mvncall_integrated_v3plus_nounphased.rsID.genotypes.GRCh38_dbSNP_no_SVs.vcf.gz -r GRCh38_full_analysis_set_plus_decoy_hla-no-chr.fa > GRCh38-MHC-ref.vg

Which gives me: GRCh38-MHC-ref.vg.zip

vg msga -g GRCh38-MHC-ref.vg -f MHC.fa -B 128 -K 11 -X 2 -E 3 -G -S 0.95 -H 5 -z -D -t 16 > GRCh38-MHC-haps.vg

I can't compress GRCh38-MHC-haps.vg small enough to add on github, but can send it another way (Dropbox?) if you'd like.

hgibling commented 8 years ago

Hi,

Wondering if anyone's had a chance to look into what might be going on here.

adamnovak commented 8 years ago

When you get the final graph, does vg validate like it?

I really have no idea what's going on here, beyond the edges needed for the reference path having escaped from the graph somehow; @ekg is the msga person.

I can reproduce this with this tiny starting graph, though. Somehow it tries to make an xg index that has path 6 from the input graph but not the correct nodes/edges for it...

Perhaps we should turn that warning into an error?

part.vg.zip

hgibling commented 8 years ago

When you get the final graph, does vg validate like it?

I get nothing returned to terminal, so... yes I think?

ekg commented 8 years ago

Yes, no error message means the graph is valid. I'll try to reproduce this today so we can resolve the bug you ran into. Sorry for the delay.

On Sun, Jun 26, 2016, 22:36 Heather Gibling notifications@github.com wrote:

When you get the final graph, does vg validate like it?

I get nothing returned to terminal, so... yes I think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/363#issuecomment-228621298, or mute the thread https://github.com/notifications/unsubscribe/AAI4ESD5S105whJpXOr3X3AXULRD5tbFks5qPuLLgaJpZM4Im2Ap .

ekg commented 8 years ago

We could make it output a positive result. Any suggestions? It doesn't because Unix convention is "no error, no message".

On Mon, Jun 27, 2016, 08:02 Erik Garrison erik.garrison@gmail.com wrote:

Yes, no error message means the graph is valid. I'll try to reproduce this today so we can resolve the bug you ran into. Sorry for the delay.

On Sun, Jun 26, 2016, 22:36 Heather Gibling notifications@github.com wrote:

When you get the final graph, does vg validate like it?

I get nothing returned to terminal, so... yes I think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/363#issuecomment-228621298, or mute the thread https://github.com/notifications/unsubscribe/AAI4ESD5S105whJpXOr3X3AXULRD5tbFks5qPuLLgaJpZM4Im2Ap .

ekg commented 8 years ago

Digging back into this. I've been able to construct the graph following the commands and inputs. I think the warnings are somehow related to feeding in a graph generated by vg construct, but I'm not yet sure exactly why. I suspect that it's possible to work around this cleanly by stripping the path out of the graph produced from the VCF by vg mod -D GRCh38-MHC-ref.vg >GRCh38-MHC-ref.nopath.vg and then putting this into the same vg msga command. In the end we map the reference path into the graph as part of the progressive assembly process, so we can drop it up front without issue. I'm trying this now to verify that it checks out.

ekg commented 8 years ago

Also, I find the graph that is generated to be valid at least as far as vg validate knows, which makes the warnings all the more mysterious.

hgibling commented 8 years ago

We could make it output a positive result. Any suggestions? It doesn't because Unix convention is "no error, no message".

Any sort of "graph checks out fine" would be helpful :)

I think the warnings are somehow related to feeding in a graph generated by vg construct, but I'm not yet sure exactly why.

I've been able to do a vg construct and feed that into vg msga -g for a small toy example without issues, but yeah it doesn't seem to like this particular graph.

I suspect that it's possible to work around this cleanly by stripping the path out of the graph produced from the VCF by vg mod -D GRCh38-MHC-ref.vg >GRCh38-MHC-ref.nopath.vg and then putting this into the same vg msga command. In the end we map the reference path into the graph as part of the progressive assembly process, so we can drop it up front without issue. I'm trying this now to verify that it checks out.

Any luck?

ekg commented 8 years ago

It did work. I can post it for you or just provide the exact commands. It seems there is something broken or mismatched with how the path annotations come out of vg construct. The reference path is embedded in the end as it is one of the sequences in the MHC.fa, so practically this shouldn't be a problem.

On Wed, Jun 29, 2016, 15:38 Heather Gibling notifications@github.com wrote:

We could make it output a positive result. Any suggestions? It doesn't because Unix convention is "no error, no message".

Any sort of "graph checks out fine" would be helpful :)

I think the warnings are somehow related to feeding in a graph generated by vg construct, but I'm not yet sure exactly why.

I've been able to do a vg construct and feed that into vg msga -g for a small toy example without issues, but yeah it doesn't seem to like this particular graph.

I suspect that it's possible to work around this cleanly by stripping the path out of the graph produced from the VCF by vg mod -D GRCh38-MHC-ref.vg > GRCh38-MHC-ref.nopath.vg and then putting this into the same vg msga command. In the end we map the reference path into the graph as part of the progressive assembly process, so we can drop it up front without issue. I'm trying this now to verify that it checks out.

Any luck?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/363#issuecomment-229358105, or mute the thread https://github.com/notifications/unsubscribe/AAI4ETKH2iIA6x63d0me_lWH-azwn5njks5qQnVagaJpZM4Im2Ap .

hgibling commented 8 years ago

Ooh yay! Commands would be great :)

ekg commented 8 years ago

@hgibling this is all I've done:

time vg mod -D GRCh38-MHC-ref.vg >GRCh38-MHC-ref.nopath.vg
time vg msga -g GRCh38-MHC-ref.nopath.vg -f MHC.fa -B 128 -K 11 -X 2 -E 3 -G -S 0.95 -H 5 -z -D -t 16 > GRCh38-MHC-haps.1.vg

No errors are registered. The ref path gets embedded as well as it's in the MHC.fa.

ekg commented 8 years ago

@hgibling did you have a chance to try this out?

hgibling commented 8 years ago

Yep, works great. Thanks so much!

ekg commented 8 years ago

Great! We should still figure out what's different with the construct result, but this is promising!

Looking forward to what you cook up.

On Thu, Jun 30, 2016, 19:12 Heather Gibling notifications@github.com wrote:

Yep, works great. Thanks so much!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/363#issuecomment-229725178, or mute the thread https://github.com/notifications/unsubscribe/AAI4EYqtNbxtEKNZB5xRTSMwWNaHrVCKks5qQ_kDgaJpZM4Im2Ap .

adamnovak commented 8 years ago

I'm going to close this since the problem seems to be resolved. The construct issue may want its own issue thread.