vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

vg view reports that a subgraph by vg chunk is invalid #2504

Open subwaystation opened 5 years ago

subwaystation commented 5 years ago

Hi vgteam :) @superjox @trgibbons

  1. What you were trying to do: I tried to use https://github.com/vgteam/sequenceTubeMap in order to browse certain positions S288C_chrVII:95084-95584 of a yeast 12 sample pangenome. The graph was build with https://github.com/ekg/seqwish which's .gfa was ported to .vg . A .xg was created, too. In order to confirm the issue, I also run the vg chunk and vg view manually. That produced the same problem. However, when I leave the -T and -b + -E out, the command runs through without issues. But as SequenceTubeMap requires these inputs, I am stuck here.

  2. What you wanted to happen: Take a look at the specified positions.

  3. What actually happened: SequenceTubeMap output:

    
    http POST chr22_v4 received
    nodeID = 95084
    distance = 500
    no gam index provided.
    no gbwt file provided.
    dataPath = ./exampleData/
    [ 'chunk',
    '-x',
    './exampleData/joint_yeast_genomes-twelve.xg',
    '-c',
    '2',
    '-p',
    'S288C_chrVII:95084-95584',
    '-T',
    '-b',
    './tmp-152f2180-ec06-11e9-a890-7bfa71685467/chunk',
    '-E',
    './tmp-152f2180-ec06-11e9-a890-7bfa71685467/regions.tsv' ]
    vg chunk exited with code 0
    vg view err data: graph path 'DBVPG6765_chrVII' invalid: edge from 4218946 start to 1204907 start does not exist

vg view err data: [vg view] warning: graph is invalid!

vg view exited with code 0

And then SequenceTubeMap just crashes, of course.
Output, when I do the same thing solely on the command line:

graph path 'DBVPG6765_chrVII' invalid: edge from 4218946 start to 1204907 start does not exist [vg view] warning: graph is invalid!


4. What data and command line to use to make the problem recur, if applicable:
`vg: variation graph tool, version v1.19.0 "Tramutola"`

time vg view --gfa-in /ctx/projects/Q2380-Pantograph/03_data_processing/10_seqwish/10_yeast/21_PacBio_twelve/joint_yeast_genomes-twelve.gfa --vg > joint_yeast_genomes-twelve.vg

vg index -x joint_yeast_genomes-twelve.xg -t 5 joint_yeast_genomes-twelve.vg

vg chunk -x joint_yeast_genomes-twelve.xg -c 2 -p S288C_chrVII:95084-95584 -T -b chunk -E regions.tsv | vg view -j - > S288C_chrVII:95084-95584.json

vg chunk -x joint_yeast_genomes-twelve.xg -c 2 -p S288C_chrVII:95084-95584 | vg view -j - > S288C_chrVII:95084-95584.json



5. Provide links to data, if possible:
[joint_yeast_genomes-twelvexg.gz](https://computomics.com/sharing/download.php?id=201&token=WD0o4thy1iNFBfi9gwSv2SrRFxZig7e3)
[joint_yeast_genomes-twelvevg.gz](https://computomics.com/sharing/download.php?id=200&token=m1QpG4bk2yVrW8SVigVIVPvKaLethqSt)
[S288C_chrVII:95084-95584.json.zip](https://github.com/vgteam/vg/files/3716672/S288C_chrVII.95084-95584.json.zip)
subwaystation commented 5 years ago

I also played around with the -c parameter [20, 100, 1000, 500, 50], but that did not solve the issue.

ekg commented 5 years ago

This might have to do with the invalidity of paths in subgraphs. We have been talking about how to resolve this for some time without much progress. There are some simple hacks, like making new paths with a naming that relates them to the path range they are derived from.

subwaystation commented 5 years ago

Thanks for the feedback @ekg . I have to admit, this makes me kind of unhappy. Can you point me to these hacks? Are there any examples? Can I contribute somehow so that we can solve this issue in a foreseeable time?

I assume deleting the invalid paths would solve the issue. But then we wouldn't have a complete subgraph?

glennhickey commented 5 years ago

vg used to use the "rank" field to somewhat support disconnected paths. But we lost that when switching to the new API. The discussion on how to properly support subpaths with the new API is here: https://github.com/vgteam/libhandlegraph/issues/29

vg chunk takes care to ensure that the reference path (DBVPG6765_chrVII) is not disconnected. And this has been sufficient for our VCF-based graphs. But now you have other assemblies in the graph and that's tripping up DBVPG6765_chrVII.

I think we probably need Erik's simple hack of renaming path chunks to get around this. It might go here: https://github.com/vgteam/vg/blob/master/src/algorithms/subgraph.cpp#L292-L322

I'll try to take a shot at implementing it today. Sorry about this!

subwaystation commented 5 years ago

Thanks for the prompt answer @glennhickey !

Ah I see.... so I want a subgraph where the reference path is not really part of any more, because of the assembly styled graph. And that vg chunk can not handle.

Cool, looking forward to that implementation ;)

subwaystation commented 5 years ago

If I can assist you at some point, let me know.

subwaystation commented 5 years ago

One edge case I can think of, having SequenceTubeMap and the current vg chunk in mind, is the following: If I extract a subgraph by path_name:start_pos-end_pos, I will only get the paths running through the nodes of the subgraph. But it could be, that there is a path, which does not have any of the sequence represented by these nodes. Therefore, it is anchored in a node more left and a node more right to the subgraph. But, this might be a structural variation I want to be able to show in e.g. SequenceTubeMap. Would it still be a valid subgraph if there is a path in it, having no visiting nodes?

glennhickey commented 5 years ago

I just tried the data and can reproduce. But I'm curious to know why tubemaps is crashing though? vg view reporting the graph is invalid is just a warning. It still exits with code 0 (as per your output). Is it that tubemaps is looking for the edge that's missing in the path?

subwaystation commented 5 years ago

I suspect that it can not deal with the fact that the *.annotate.txt file is empty. But, I have to admit that I am not familiar enough with TubeMaps to test that out, yet. I fuzzled around in the code, so that TubeMaps' implemented command line leaves out -T -b -E and then it just breaks again giving no helpful error whatsoever. At least to me.

ekg commented 5 years ago

@glennhickey that code snippet doesn't quite do what I'm suggesting.

My idea was to break the paths where they are discontinuous in the subgraph. For each broken path segment, we set a name that relates it to the path it was derived from.

The hack I wanted to implement was using naming convention to convey path ranges of the subgraph.

So if we had a path x that got split into pieces we might get paths like [x]:10-20, [x]:30-40. Then we could also make another subgraph of one, yielding [[x]:30-40]:3-6. Maybe we should be translating the positions from the original path, but that would be a bit more involved.

glennhickey commented 5 years ago

@ekg which code snippet?

ekg commented 5 years ago

The subgraph one you have a PR against.

On Fri, Oct 11, 2019, 19:15 Glenn Hickey notifications@github.com wrote:

@ekg https://github.com/ekg which code snippet?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2504?email_source=notifications&email_token=AABDQEMACY5SXBMNKLPIT5TQOCYDPA5CNFSM4I7XIQ7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBAURKQ#issuecomment-541149354, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPGGTN6LGD3M5ID7E3QOCYDPANCNFSM4I7XIQ7A .

subwaystation commented 5 years ago

Any updates here? As far as I got it, #2506 did not pass Travis?

glennhickey commented 5 years ago

Are you able to try the branch from #2506 to see if it solves your problem? That PR's stuck on a unit test failure that occurs only on Mac that I'm having trouble reproducing.

subwaystation commented 5 years ago

I will try it out and report back here.

subwaystation commented 5 years ago

So I did:

git clone --recursive https://github.com/vgteam/vg.git
cd vg
git checkout glenn
. ./source_me.sh && make

And I ran into:

In file included from src/packed_path_position_overlays.cpp:1:
include/bdsg/packed_path_position_overlays.hpp:16:10: fatal error: BooPHF.h: No such file or directory
   16 | #include <BooPHF.h>
      |          ^~~~~~~~~~
compilation terminated.
make[1]: *** [Makefile:63: obj/packed_path_position_overlays.o] Error 1
make[1]: Leaving directory '/home/heumos/git/vg_2504/vg/deps/libbdsg'
make: *** [Makefile:618: lib/libbdsg.a] Error 2
subwaystation commented 5 years ago

Is there a dependency that is not installed on my machine? I have ArchLinux running.

subwaystation commented 5 years ago

The README of https://github.com/vgteam/libbdsg tells me, I need to have https://github.com/rizkg/BBHash/tree/alltypes installed in a place on the system where the compiler can find them. But BBHash seems to be there:

[heumos@wave deps]$ ls /home/heumos/git/vg_2504/vg/deps/BBHash/
BooPHF.h         example.cpp                      LICENSE
bootest.cpp      example_custom_hash.cpp          makefile
bootestFile.cpp  example_custom_hash_strings.cpp  README.md

I would expect that the MAKEFILE takes care of the rest?

glennhickey commented 5 years ago

You can try updating the submodules (git submodule sync --recursive ; git submodule update --init --recursive), or running

git clone --recursive https://github.com/vgteam/vg.git --branch glenn

at the outset.

On Tue, Oct 29, 2019 at 10:58 AM Simon Heumos notifications@github.com wrote:

The README of https://github.com/vgteam/libbdsg tells me, I need to have https://github.com/rizkg/BBHash/tree/alltypes installed in a place on the system where the compiler can find them. But BBHash seems to be there:

[heumos@wave deps]$ ls /home/heumos/git/vg_2504/vg/deps/BBHash/ BooPHF.h example.cpp LICENSE bootest.cpp example_custom_hash.cpp makefile bootestFile.cpp example_custom_hash_strings.cpp README.md

I would expect that the MAKEFILE takes care of the rest?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2504?email_source=notifications&email_token=AAG373X4LGE7VXL3BT2N5RLQRBFQLA5CNFSM4I7XIQ7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECQ2AUQ#issuecomment-547463250, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373UI7X4HPGQEGBYJM63QRBFQLANCNFSM4I7XIQ7A .

subwaystation commented 5 years ago

Thanks! On it again.

subwaystation commented 5 years ago

I did:

git clone --recursive https://github.com/vgteam/vg.git --branch glenn
cd vg/
. ./source_me.sh && make

And I still get:

In file included from src/packed_path_position_overlays.cpp:1:
include/bdsg/packed_path_position_overlays.hpp:16:10: fatal error: BooPHF.h: No such file or directory
   16 | #include <BooPHF.h>
      |          ^~~~~~~~~~
compilation terminated.
make[1]: *** [Makefile:63: obj/packed_path_position_overlays.o] Error 1
make[1]: Leaving directory '/home/heumos/git/vg_2504/vg/deps/libbdsg'
make: *** [Makefile:618: lib/libbdsg.a] Error 2

Am I missing something?

The file exists in deps:

[heumos@wave vg]$ ls deps/BBHash/BooPHF.h 
deps/BBHash/BooPHF.h
glennhickey commented 5 years ago

That worked fine here. I just rebased this on master, which may contain some fixes that make building more robust. If you're able to build the master branch, this one should too (fresh checkout recommended).

On Wed, Oct 30, 2019 at 5:30 AM Simon Heumos notifications@github.com wrote:

I did:

git clone --recursive https://github.com/vgteam/vg.git --branch glenn cd vg/ . ./source_me.sh && make

And I still get:

In file included from src/packed_path_position_overlays.cpp:1: include/bdsg/packed_path_position_overlays.hpp:16:10: fatal error: BooPHF.h: No such file or directory 16 | #include | ^~~~~~ compilation terminated. make[1]: [Makefile:63: obj/packed_path_position_overlays.o] Error 1 make[1]: Leaving directory '/home/heumos/git/vg_2504/vg/deps/libbdsg' make: [Makefile:618: lib/libbdsg.a] Error 2

Am I missing something?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2504?email_source=notifications&email_token=AAG373RMASENQEHBXFAE45TQRFH3FA5CNFSM4I7XIQ7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECTPLEY#issuecomment-547812755, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG373QVVZDX5E27N252R2LQRFH3FANCNFSM4I7XIQ7A .

subwaystation commented 5 years ago

I am not even able to build the master branch on my machine, see https://github.com/vgteam/vg/issues/2522. But it aborts with a different error. Puzzling.

I will try to build on a VM which hosts Ubuntu 18.04. But I still want to be able to compile vg on my machine.

subwaystation commented 5 years ago

So I was able to build both the current master and @glennhickey's branch on Ubuntu 18.04. Now I can test his implementation.

But it would make me really happy, if I could compile vg on my machine.