vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

Reuse of HGSVC graph #3612

Open JD12138 opened 2 years ago

JD12138 commented 2 years ago

Hi, I want to use your HGSVC graph for my own analysis. In your ftp website, I saw many different graphs. As following: 1) https://cgl.gi.ucsc.edu/data/giraffe/mapping/graphs/for-NA19240/hgsvc/hs38d1/HGSVC_hs38d1.sampled.64 2) https://cgl.gi.ucsc.edu/data/giraffe/calling/hgsvc/ 3) https://cgl.gi.ucsc.edu/data/giraffe/products/HGSVC_hs38d1. 4) https://cgl.gi.ucsc.edu/data/giraffe/calling/combined-sv-graph/ I want to know: 1) What's the difference of the first three? 2) Is the fourth website the graph that you used for genotyping 5202 samples? 3) Which one is recommended to reuse?

adamnovak commented 2 years ago

The main documentation page is here: https://cglgenomics.ucsc.edu/giraffe-data/

1) https://cgl.gi.ucsc.edu/data/giraffe/mapping/graphs/for-NA19240/hgsvc/hs38d1/HGSVC_hs38d1.sampled.64* This is a graph with haplotype data re-sampled so that it has ~64 local haplotypes at each point.

2) https://cgl.gi.ucsc.edu/data/giraffe/calling/hgsvc/ @jmonlong might have to explain the "calling" graph, because it looks like the readme doesn't quite explain how it was generated. I'm not sure what the "N32" means here.

3) https://cgl.gi.ucsc.edu/data/giraffe/products/HGSVC_hs38d1.* These are our best HGSVC-only pangenome. They have the 6 HGSVC haplotypes in them, and only synthesize fake haplotypes for regions with no variant call data.

4) https://cgl.gi.ucsc.edu/data/giraffe/calling/combined-sv-graph/ This is a graph that @jmonlong made from the three SV catalogs, and which was used for SV typing the 5k samples. The haplotype data here can't really be right, because we didn't feed in any call sets called across the whole combined catalog.

As for which we recommend overall, I can't really say; we don't directly benchmark the HGSVC-only graph against the combined catalog graph in the paper. But if you want to genotype SVs, you probably want the "combined-sv-graph" graph, since that's the one we put together for that purpose.

The very best graph would probably result from taking the genotypes we called, phasing them, and making a new graph that includes that phased haplotype information. But we haven't built that yet.

JD12138 commented 2 years ago

The main documentation page is here: https://cglgenomics.ucsc.edu/giraffe-data/

  1. https://cgl.gi.ucsc.edu/data/giraffe/mapping/graphs/for-NA19240/hgsvc/hs38d1/HGSVC_hs38d1.sampled.64* This is a graph with haplotype data re-sampled so that it has ~64 local haplotypes at each point.
  2. https://cgl.gi.ucsc.edu/data/giraffe/calling/hgsvc/ @jmonlong might have to explain the "calling" graph, because it looks like the readme doesn't quite explain how it was generated. I'm not sure what the "N32" means here.
  3. https://cgl.gi.ucsc.edu/data/giraffe/products/HGSVC_hs38d1.* These are our best HGSVC-only pangenome. They have the 6 HGSVC haplotypes in them, and only synthesize fake haplotypes for regions with no variant call data.
  4. https://cgl.gi.ucsc.edu/data/giraffe/calling/combined-sv-graph/ This is a graph that @jmonlong made from the three SV catalogs, and which was used for SV typing the 5k samples. The haplotype data here can't really be right, because we didn't feed in any call sets called across the whole combined catalog.

As for which we recommend overall, I can't really say; we don't directly benchmark the HGSVC-only graph against the combined catalog graph in the paper. But if you want to genotype SVs, you probably want the "combined-sv-graph" graph, since that's the one we put together for that purpose.

The very best graph would probably result from taking the genotypes we called, phasing them, and making a new graph that includes that phased haplotype information. But we haven't built that yet.

Thank you very much! And I still have two questions.

  1. You said "The very best graph would probably result from taking the genotypes we called", then which of the following is the one that I can use to phasing and make a new graph? a) https://cgl.gi.ucsc.edu/data/giraffe/products/vggiraffe-sv-2504kgp-raw.vcf.gz b) https://cgl.gi.ucsc.edu/data/giraffe/products/vggiraffe-sv-2504kgp-svsites.gt.vcf.gz

  2. Is this vcf (https://cgl.gi.ucsc.edu/data/giraffe/construction/HGSVC.haps.vcf.gz) the origin vcf that you use to construct the HGSVC graph?

adamnovak commented 2 years ago

According to the products README, we have:

So I think vggiraffe-sv-2504kgp-svsites.gt.vcf.gz has actually had VCF records merged up so that they talk about the same alleles, while vggiraffe-sv-2504kgp-raw.vcf.gz would just have the raw single-sample calls for each sample, with no attempt to integrate them to talk about the same variants. I think vggiraffe-sv-2504kgp-svsites.gt.vcf.gz would be more feasible to try and impute phasing for.

I think that https://cgl.gi.ucsc.edu/data/giraffe/construction/HGSVC.haps.vcf.gz would be the VCF used to make the HGSVC-only graphs. It has the phased haplotypes for the three HGSVC samples.

JD12138 commented 2 years ago

According to the products README, we have:

  • VCF at the SV site level. Alleles were combined if matching (>=80% reciprocal overlap or sequence similarity). The allele was counted across all alleles at each site for each sample.

    • vggiraffe-sv-2504kgp-svsites.gt.vcf.gz and vggiraffe-sv-2504kgp-svsites.gt.vcf.gz.tbi: VCF and index for the 2,504 unrelated individuals in the 1000 Genomes Project. Includes allele counts, genotypes and genotype qualities, in addition to INFO such as allele frequency in all samples or for each of the super populations EUR/AFR/EAS/SAS/AMR.
  • Raw VCFs: VCF containing all the information from vg call (inc. GL) but at the allele level, i.e. >1M alleles.

    • vggiraffe-sv-2504kgp-raw.vcf.gz and vggiraffe-sv-2504kgp-raw.vcf.gz.tbi VCF and index for the 2,504 unrelated individuals of the 1000 Genomes Project.

So I think vggiraffe-sv-2504kgp-svsites.gt.vcf.gz has actually had VCF records merged up so that they talk about the same alleles, while vggiraffe-sv-2504kgp-raw.vcf.gz would just have the raw single-sample calls for each sample, with no attempt to integrate them to talk about the same variants. I think vggiraffe-sv-2504kgp-svsites.gt.vcf.gz would be more feasible to try and impute phasing for.

I think that https://cgl.gi.ucsc.edu/data/giraffe/construction/HGSVC.haps.vcf.gz would be the VCF used to make the HGSVC-only graphs. It has the phased haplotypes for the three HGSVC samples.

Thanks! I have counted the SV number in https://cgl.gi.ucsc.edu/data/giraffe/construction/HGSVC.haps.vcf.gz. There are 66863 SVs in the file. But in your science paper "Pangenomics enables genotyping of known structural variants in 5202 diverse genomes",there are 78,106 SVs in your HGSVC graph. Why the numbers are different?

adamnovak commented 2 years ago

@jmonlong Can you explain why these numbers are different?

The supplement says "Reported variant counts were derived from the VCFs used to build the graphs, with bcftools stats.", and that does look like the right file to be counting.

How exactly did you count SVs @JD12138?