vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

Issue with VCF generated after running vg haplotypes #4258

Open eblerjana opened 3 months ago

eblerjana commented 3 months ago

1. What were you trying to do?

I'm trying to run the haplotype sampling algorithm to create a VCF of the resulting subgraph with vg deconstruct. I used the commands shown below based on (CHM13-based) MC graph "hgsvc3_and_hprc-dec16-mc-chm13" available here: https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=pangenomes/scratch/2023_12_16_minigraph_cactus_hgsvc3_and_hprc/, and Illumina short reads for sample HG00733 from the 1000G.

GRAPH_HAPL="hgsvc3_and_hprc-dec16-mc-chm13.hapl"
GRAPH_GBZ="hgsvc3_and_hprc-dec16-mc-chm13.gbz"
READS="HG00733_ERR3988823.fasta.gz"

kmc -k31 -m64 -okff -fm -t24 -hp $READS HG00733 .
vg haplotypes -v 2 -t 24 --num-haplotypes 8 --include-reference -i $GRAPH_HAPL -k HG00733.kff -g HG00733.gbz $GRAPH_GBZ
vg deconstruct HG00733.gbz -P CHM13 -a -t 24 | bgzip > HG00733.vcf.gz

2. What did you want to happen?

I want to create a VCF from the subsampled graph to call bubbles relative to the reference genome (CHM13).

3. What actually happened?

The commands run successfully, but I noticed some issue with the resulting VCF file. According to the AT fields in the VCF, some reference alleles traverse nodes previously not traversed by the reference path in the GFA. In other words, nodes appear in these AT traversals that are not annotated as reference nodes in the GFA (missing SN:Z and SO:i tags) and additionally are not contained in the W line of the respective reference path. One such example is the node with ID "8521880". I attached a file containing an example of such a line in the VCF (for node: 8521880). I'm confused how this can happen. Did I do anything wrong?

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

not applicable

5. What data and command can the vg dev team use to make the problem happen?

see above (1.)

6. What does running vg version say?

vg version v1.54.0 "Parafada"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by xian@octo

vcf-line.txt

adamnovak commented 2 months ago

@glennhickey Do you know what deconstruct might be thinking to come up with the wrong reference traversals?

glennhickey commented 2 months ago

This looks like the usual confusion about the node ids being different between the GFA (unclipped) and GBZ (clipped to 1024bp). You can generally get the GFA node ids out of vg deconstruct by using its -O option

    -O, --gbz-translation    Use the ID translation from the input gbz to apply snarl names to snarl names and AT fields in output

A few more details about node clipping in general can be found here:

https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md#node-chopping