vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 191 forks source link

questions on path cover haplotypes and haplotype sampling #4189

Closed HongboDoll closed 6 months ago

HongboDoll commented 6 months ago

Hi vg team,

I am working on a reference genome (12 chromosomes) and an unphased VCF. I run vg autoindex with the reference and the vcf and it produced a gbz graph with indices. Vg path showed that there are 16 "path cover" with a SENSE of HAPLOTYPE for each chromosome.

I checked the wiki for gbwt command but I still found difficult to understand what is path cover (perhaps it is the only option to build a gbwt from unphased VCF for giraffe to map reads on?).

There is a haplotype sampling section in the wiki (https://github.com/vgteam/vg/wiki/Haplotype-Sampling), which requires a gbz graph as an input. I am wondering if I can use my gbz constructed from unphased VCF to perform haplotype sampling?

There is a parameter "--num-haplotypes" in vg haplotypes, with default setting to 4. Is it reasonable to increase this value to a higher one such as 64 or even 128?

Thank you very much

jltsiren commented 6 months ago

Giraffe is a haplotype-based aligner. It aligns the reads (usually) to the haplotypes you provide, but it uses the alignment (of the haplotypes) implied by the graph to avoid redundant work.

If you don't have true haplotypes in the input, the path cover option will generate artificial paths in the graph that can be used as haplotypes. 16 paths are usually enough to represent any combination of 4 successive variants, which is reasonable for mapping short reads. However, mapping speed and accuracy are unlikely to be as good as with true haplotypes.

Haplotype sampling is intended to be used with true haplotypes. The idea is that instead of mapping reads to a single universal reference, you map them to a subgraph that is similar to the sequenced genome. By avoiding variants that are present in the reference haplotypes but not in the sequenced genome, both mapping speed and accuracy should improve further. But if you don't have true haplotypes to begin with, sampling is unlikely to do anything useful. And because the sampling process selects a subset of local haplotypes in each block, it cannot increase the number of haplotypes in the graph.

HongboDoll commented 6 months ago

Many thanks