vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 194 forks source link

How to Generate a VCF File Showing Differences Between HG002 Haplotypes from the HPRC Pangenome #4419

Open xuxingyubio opened 1 month ago

xuxingyubio commented 1 month ago

Hello VG Team, I am currently working with the HPRC pangenome and aiming to construct a VCF file that highlights the differences between the two haplotypes (hap1 and hap2) of the HG002 sample. Specifically, I want to generate a VCF file that represents hap2 relative to hap1 for HG002. So far, I have downloaded the HPRC pangenome data from the HPRC project, which includes multiple haplotypes for various samples, including HG002. I have attempted to use VG tools, such as vg convert to change the reference, but found that it doesn't seem to support operations targeting individual haplotypes, and vg deconstruct to obtain VCF files; however, it appears that it does not allow for processing single haplotypes separately. It seems that the current VG tools do not support operations on individual haplotypes within a sample. I am specifically looking to extract the variant differences between hap1 and hap2 of HG002 and represent them in a VCF file. Could you please guide me on how to effectively generate a VCF file that captures the differences between the two haplotypes (hap1 and hap2) of the HG002 sample from the HPRC pangenome? Thank you for your support!

glennhickey commented 1 month ago

HG002 was held out of the release HPRC graphs. If you want to make your own hg002-only graph you can do so quite quickly with minigraph-cactus

You'd feed it something like

HG002_hap1   HG002.hap1.fa.gz
HG002_hap2   HG002.hap2.fa.gz

And run with --reference HG002_hap1 HG002_hap2 --vcf --vcfreference HG002_hap1 HG002_hap2 among the usual options to get a pair of haploid vcf's comparing each haplotype with the other.

Otherwise if you already have a graph with HG002 in it, then I think deconstruct -P will work. You may need to promote HG002 to a reference path as described here

xuxingyubio commented 1 month ago

Thank you for your response. I followed your method and tried it out, but I noticed that the contig lengths in the generated VCF file do not match the original lengths, resulting in positional misalignment. Could this be due to some trimming performed during pangenome construction(minigraph cactus)? Is it possible to obtain the trimmed fasta file used in pangenome construction?

id=NA12878hap1|ptg000002l 1895202

contig=

glennhickey commented 1 month ago

Yeah, that's a known issue due to path fragmentation. The VCF itself is valid and coordinates correct, it's just that the contig lengths can be too short in the header. This only happens when multiple references are given, and only to references after the first (so hap2 in our example). You options are:

xuxingyubio commented 1 month ago

I used hg38 as the reference, then switched the reference using vg convert and constructed the VCF file with vg deconstruct. However, I noticed that the reference bases in the VCF file at the corresponding coordinates do not match the original bases in the input FASTA file. Can using an unclipped graph solve this problem?