How does vg handle phased and unphased Genotypes in VCF?

subwaystation commented 7 years ago

Hi vgteam!

I am trying to understand how vg handles different phenotypes of a diploid organism like 1|0and 1|2 (phased) or 1/0and 1/2 (unphased). Therefore I took a look at your tiniest example data set:

vg construct -v tiny.vcf.gz -r tiny.fa > tiny.vg

vg view -j tiny.vg > tiny.json

But only path x appears in the JSON. What am I missing?

Even the following

vg view -jpnw tiny.vg > tiny.json

Does not solve the issue. I would expect to see at least 3 paths there, right? tiny_SeqTubeMap.pdf

Viewing the graph in dot format:

vg view -dpnw tiny.vg | dot -Tsvg -o tiny.svg

This looks much better in the browser. But still only one real path is drawn. tiny_Dot.pdf

Thanks!

edawson commented 7 years ago

Hey @subwaystation ,

Sorry this has sat a few days. There short answer is that we don't do anything with haplotype info.

Paths in the graph come from the reference ('x' in this example) or the alleles (all those 'alt[a-z0-9]*[0-9]* paths). We currently construct graphs so that all possible haplotypes (not just those that are observed) are present in the graph's structure.

We don't differentiate between phased or unphased genotypes, and if you put two adjacent phased SNPs in you'd get the same structure as with unphased SNPs.

Does that make sense? It's a bit different in design from how a lot of pangenomes / cDBGs do things.

subwaystation commented 7 years ago

Hey @edawson,

thanks, that helped me a lot. Now I understand what is going on.

However, is there any planning to support a more differentiated haplotype model when constructing a graph? Or could you point out to me, why this may not be necessary any more?

edawson commented 7 years ago

Technically, we support labeling the haplotypes as Paths, though it's kind of a nontrivial workflow. You could for example label all phased/unphased haplotypes as such, or (one day) phase using the graph. You could then look at how many reads support a given haplotype and quantify how much of a given sample is present in the reads (e.g. for metagenomics). We do support gPBWT and gBWT queries for indexing paths but that's something either @ekg or @adamnovak would know more about than I.

There has been discussion at some point about whether to restrict the graph to only supported haplotypes, but then we'd lose the ability to easily map any variation that might break the haplotypes present in the VCF. We haven't yet seen a test case where restricting the graph to only the observed haplotypes or incorporating phasing info would be significantly beneficial.

adamnovak commented 7 years ago

If you want to look at haplotypes from a VCF on a tube map, you can use the --alt-paths option when building the graph with vg construct, and pass the --vcf-phasing option to vg index to load the phased chunks from the VCF into the XG index. Unphased sites will cut a haplotype. Then you need to use the tube map generator in the mode where it pulls from the gPBWT to extract the locally distinct haplotypes.

I think you can also use --write-haps if you only have a few haplotypes to get vg paths for the haplotypes, but then you'd have to get those back into a graph.

subwaystation commented 7 years ago

@edawson I can see now, why all possible haplotypes should be present in the graph, especially when one has the mapping in mind.

However, given the following case: I have a single reference FASTA and a VCF (phased, unphased, can be arbitraty). Now I only want to visualize (a subgraph with) the exact haplotype information that is given in the VCF with e.g. SequenceTubeMap. Here, it would make sense to keep only the paths in the graph that reflect the VCF, right? Or am I later able to select the required paths? As far as I tested your toolkit, the sample names from a VCF are not mapped into the graph structure? @adamnovak How would I proceed here? I tried your advice, but I got stuck at the TubeMapGenerator step because I don't know anything about it. And I was not able to find any information in this repository. The SequenceTubeMap's server names a trace module which seems to be part of vg but I can not find it, either. Thanks for any hints!

By the way, what do you mean by cut a haplotype? Is the path representing that haplotype split into several paths (vertically or horizontally)?

adamnovak commented 7 years ago

Cutting it means that it is split horizontally.

It looks like to use the tube map you need to have some commits from @yoheirosen's fork of vg, where the vg trace command has been implemented. I'm not sure why that hasn't been taken up into master yet.

https://github.com/yoheirosen/vg

The general idea there is you run a server program that runs vg commands to extract parts of your big indexed graph, and the different locally-distinct haplotypes. But it needs a vg with vg trace.

I'm going to open a PR for that right now.

subwaystation commented 7 years ago

Thanks for the detailed explanation guys! 👍

I will try out vg trace when it has landed in the master branch.

vgteam / vg

How does vg handle phased and unphased Genotypes in VCF? #781