pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
369 stars 40 forks source link

empty VCF after running PanGenIe on the pggb assembly #228

Closed Overcraft90 closed 1 year ago

Overcraft90 commented 2 years ago

Hi there,

I generated a graph for five human individuals with the following command:

pggb -i /g100_scratch/userexternal/mungaro0/pangenomes/pangenome_ref_guided_pggb-v4.fa.gz -o /g100_scratch/userexternal/mungaro0/VGS -t 48 -p 98 -s 100000 -n 12 -k 311 -O 0.03 -T 24 -G 13117,13219 -S -m -V 'GRCh38_p14v2.0.fna.gz:#'

Then, I set GRCh38 as a refernce for the VCF outputted once the pangenome has been assembled. My plan is to run PanGenIe and benchmark HG005 (an Asian sample not present in the pangenome) against the graph. After getting the tool to work, at first I faced an error related to "un-phased" haplotypes, which sounded weird to me as all genome assemblies have been done using both HiFi & Hi-C data.

So, at a closer look, what was happening related to the structure for the VCF file. It splits each individual into two columns, one for each of the two related haplotypes — see image below Screenshot from 2022-09-13 11-23-40 My first approach has been to merge those columns for each one of the five individuals, separating them by "|" with the following command:

awk -F '\t' '/^#/ {print;next;} {OFS="\t";for(i=j=10; i < NF; i+=2) {$j = $i"|"$(i+1); j++} NF=j-1}1' pangenome_ref_guided_GRCh38.vcf > pangenome_ref_guided_GRCh38-phased.vcf 

This - I guess - tricked PanGenIe to work just fine; the problem is that the tool returned an empty VCF for the HG005 sample... Is there anyway I can fix this?

Upon discussing this problem with other people, Glenn kindly addressed me to the links below

  1. https://github.com/pangenome/vcfbub and
  2. https://github.com/vcflib/vcflib

I had a look at both applications, and at how they are used in the context of the HPRCyear1 repository.

However, in my case the approach I followed is probably more basic and straightforward — mainly because I was not aware of many of the details and considerations to be taken into account. For instance, I haven't merged GRCh38 and CHM13 and removed unplaced contigs from the first. Therefore, I was wondering is there still a chance to get my VCF to work with PanGenIe, running a specific command of one (or both) of those applications, which would render it "accessible" for the tool itself?

P.S. I can attach a screenshot of the file after the awk command I used if deemed useful. Also, I already made sure the headers/contigs in the reference genome I fed to PanGenIe, and the names/contigs in the #CHROM column of the VCF are the same. One thing I'm not aware of is whether the length of the contigs names in the #CHROM column somehow affects the process, or either the "#" characters cause some issues, even though I don't think it is the case

Let me know (and sorry for the long message), thanks!

AndreaGuarracino commented 1 year ago

Managed here https://github.com/eblerjana/pangenie/issues/20

image