Open subwaystation opened 7 years ago
Hey @subwaystation ,
Sorry this has sat a few days. There short answer is that we don't do anything with haplotype info.
Paths in the graph come from the reference ('x' in this example) or the alleles (all those 'alt[a-z0-9]*[0-9]* paths). We currently construct graphs so that all possible haplotypes (not just those that are observed) are present in the graph's structure.
We don't differentiate between phased or unphased genotypes, and if you put two adjacent phased SNPs in you'd get the same structure as with unphased SNPs.
Does that make sense? It's a bit different in design from how a lot of pangenomes / cDBGs do things.
Hey @edawson,
thanks, that helped me a lot. Now I understand what is going on.
However, is there any planning to support a more differentiated haplotype model when constructing a graph? Or could you point out to me, why this may not be necessary any more?
Technically, we support labeling the haplotypes as Paths, though it's kind of a nontrivial workflow. You could for example label all phased/unphased haplotypes as such, or (one day) phase using the graph. You could then look at how many reads support a given haplotype and quantify how much of a given sample is present in the reads (e.g. for metagenomics). We do support gPBWT and gBWT queries for indexing paths but that's something either @ekg or @adamnovak would know more about than I.
There has been discussion at some point about whether to restrict the graph to only supported haplotypes, but then we'd lose the ability to easily map any variation that might break the haplotypes present in the VCF. We haven't yet seen a test case where restricting the graph to only the observed haplotypes or incorporating phasing info would be significantly beneficial.
If you want to look at haplotypes from a VCF on a tube map, you can use the --alt-paths
option when building the graph with vg construct
, and pass the --vcf-phasing
option to vg index
to load the phased chunks from the VCF into the XG index. Unphased sites will cut a haplotype. Then you need to use the tube map generator in the mode where it pulls from the gPBWT to extract the locally distinct haplotypes.
I think you can also use --write-haps
if you only have a few haplotypes to get vg paths for the haplotypes, but then you'd have to get those back into a graph.
@edawson I can see now, why all possible haplotypes should be present in the graph, especially when one has the mapping in mind.
However, given the following case: I have a single reference FASTA and a VCF (phased, unphased, can be arbitraty). Now I only want to visualize (a subgraph with) the exact haplotype information that is given in the VCF with e.g. SequenceTubeMap. Here, it would make sense to keep only the paths in the graph that reflect the VCF, right? Or am I later able to select the required paths? As far as I tested your toolkit, the sample names from a VCF are not mapped into the graph structure?
@adamnovak How would I proceed here? I tried your advice, but I got stuck at the TubeMapGenerator step because I don't know anything about it. And I was not able to find any information in this repository. The SequenceTubeMap's server names a trace
module which seems to be part of vg
but I can not find it, either. Thanks for any hints!
By the way, what do you mean by cut a haplotype
? Is the path representing that haplotype split into several paths (vertically or horizontally)?
Cutting it means that it is split horizontally.
It looks like to use the tube map you need to have some commits from @yoheirosen's fork of vg, where the vg trace
command has been implemented. I'm not sure why that hasn't been taken up into master yet.
https://github.com/yoheirosen/vg
The general idea there is you run a server program that runs vg commands to extract parts of your big indexed graph, and the different locally-distinct haplotypes. But it needs a vg with vg trace
.
I'm going to open a PR for that right now.
Thanks for the detailed explanation guys! 👍
I will try out vg trace
when it has landed in the master branch.
Hi vgteam!
I am trying to understand how
vg
handles different phenotypes of a diploid organism like1|0
and1|2
(phased) or1/0
and1/2
(unphased). Therefore I took a look at your tiniest example data set:But only path
x
appears in the JSON. What am I missing?Even the following
Does not solve the issue. I would expect to see at least 3 paths there, right? tiny_SeqTubeMap.pdf
Viewing the graph in
dot
format:This looks much better in the browser. But still only one real path is drawn. tiny_Dot.pdf
Thanks!