pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
369 stars 40 forks source link

Downstream suggestions for use pggb gfa in the vg pipeline #145

Closed baozg closed 2 years ago

baozg commented 2 years ago

Hi,

I have run the pggb for each chromosome fasta of pangenome. Does the HPRC graph have any downstream use in vg pipeline?
I have several questions.

  1. How to combine the each chromosome gfa for vg and prepare the index of vg?
vg combine chr*/*.smooth.gfa > test.gfa
vg autoindex -p ref -R XG -w giraffe -g test.gfa -T ./tmp -M 230G -t 64
  1. Compared the VCF from the mapping pipeline and vg snarl

vg genotyping and calling based the snarl result of vg, if I want to compare the VCF with the pggb pangenome graph. How can I use gfa for vg deconstruct? I try the follwing code but failed to produce a vcf.


vg autoindex -p ref -R XG -w giraffe -g chr1.920c6bb.2ff309f.adf7ed8.smooth.gfa  -T ./tmp -M 230G -t 64

#  vcf only have variant without genotype, no sample
vg deconstruct -e -p ref -t 64 chr1.920c6bb.2ff309f.adf7ed8.smooth.gfa > chr1.vcf

# empty vcf with header, but with the exactly n-1 sample I use
vg deconstruct -e -p ref -t 64 chr1.xg > chr1.xg.vcf

But with the cactus vg graph, I can use the deconstruct to produce a vcf with multi-sample genotype.

  1. General pipeline suggestions of pggb gfa Does the HPRC graph have any downstream analysis using pggb output? (pggb -> vg index -> vg giraffe -> vg pack -> vg call) Is there any public avaiable pipeline I can follow?

How to evalute the different parameter set for graph building?

Thanks, Zhigui Bao

subwaystation commented 2 years ago

Hi @baozg

I am not sure I understood all your questions properly, but let me try.

  1. As an alternative for vg combine there also is odgi squeeze. Link to the Docs is https://odgi.readthedocs.io/en/latest/rst/commands/odgi_squeeze.html.
  2. We also use vg deconstruct for our per chromosome PGGB HPRC graphs. Example command line vg deconstruct -P chm13 -H # -e -a -t 48 chr10.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa. The VCFs can be plugged into e.g. https://bitbucket.org/jana_ebler/pangenie/src/master/, but that is still an ongoing research process.
  3. For an overview of the features of the constructed graph(s) we recommend to use ODGI. It can visualize, manipulate, validate, and explore pangenome graphs. For further downstream analysis you can take a look at vg. They should provide tutorials for the type of analysis you have in mind. @ChriKub already built a pangenome graph from A.thaliana. Maybe he has some tips for you regarding the usage of vg giraffe.

All the best, Simon

subwaystation commented 2 years ago

For long read mapping, there is https://github.com/maickrau/GraphAligner available.

ekg commented 2 years ago

Hi @baozg,

We do generate phased VCFs from pggb. The approach uses a feature in vg deconstruct that allows us to assign contigs to sample phase sets based on name prefixes (-H # where # is a delimiter in the path name): https://github.com/pangenome/pggb/blob/master/pggb#L531-L532

We use the Pangenome Sequence Naming convention in our graphs. This can be applied to the input FASTA to pggb, and then you don't need to worry about linking contigs to particular sample haplotypes at any subsequent stage of analysis. It ensures compatibility with all existing tools based on FASTA references and our pangenome methods. You can also rewrite the path names (P-lines) to match this, if you've already made graphs. The naming for you (assuming you have a tetraploid for instance) would be like sample1#0#contig10 for a contig on the first haplotype or sample1#3#contig99 for a contig on the 4th haplotype. I believe that vg deconstruct -H '#' -e -a will determine the ploidy automatically, but you may want to check this and set ploidy specifically with -p if the detection doesn't work.

To merge chromosome graphs, I would suggest using vg ids -j to generate a combined ID space for all the GFAs. You can then combine these with cat.

I've been using this approach to combine the graphs:

ls ../*/*.smooth.gfa.gz  | while read i; do zcat $i >$(basename $i .gz); echo $i; done && TEMPDIR=/scratch vg ids -j $(ls *smooth*.gfa | sort -bV)
( head -1 ../views/chr1.gfa; ( seq 22 ; echo X; echo Y; echo M ) | while read i; do cat ../views/chr$i.gfa | tail -n+2 ; done ) | pigz >pggb.wgg.88.gfa.gz

The id-space joined graphs are then usable by vg deconstruct. Doing this ensures that the node IDs in the VCF files (the record id and some annotations include these) match the combined graph. I used this script:

#!/bin/bash

input=$1
vcf_spec=$2
threads=$3

vg=vg-375cad7

timer=/home/erikg/.guix-profile/bin/time
fmt="%C\n%Us user %Ss system %P cpu %es total %MKb max memory"

for s in $( echo "$vcf_spec" | tr ',' ' ' );
do
    ref=$s
    echo "[vg::deconstruct] making VCF with reference=$ref"
    vcf=$(basename input .gfa).$(echo $ref | tr '/|' '_').vcf
    ( TEMPDIR=$(pwd) $timer -f "$fmt" $vg deconstruct -P $ref \
             -H '#' -e -a -t $threads "$input" >"$vcf" )
    pigz -p $threads $vcf
done

Hope this helps! Please follow up with questions or confirmation that it works.

baozg commented 2 years ago

Hi @subwaystation @ekg

Thanks for the promptly and detailed reply. I will try it right now. I have noticed the PanSN-Spec in the repo. I have already modify the chromosome name to meet the specification. I am suprised that PanGenie have accepted the vg decontruct VCF.

I will update the result when I try the command. But I still have several questions.

  1. For haplotype id, it should be 0-based or 1-based ? Does it matter ? Can we mixed the different ploidy assembly (diploid and haploid both)?

  2. odgi is very powerful for exploring the graph pangenome in more way. It have a easy to use tutorial and documentation I already try it follow the tutorial. I will try it to explore the pggb graph. But for user, it's not easy to figure out the which graph structure is optimal. We can have different graph for a complex region, which is more prevalent in the plant genome. Can I compare the VCF of vg deconsturct with the VCF based Deepariant with binned HiFi reads (seperate different haplotype and do multi-sample calling)? Is the genotype comparable in two callsets?

  3. The percentage of mapped annotated gene can also reflect the graph completeness (Like we did BUSCO or asmgene in the linear assembly). I have tried the minigraph for mapping gene with -x lr, so can I use the GraphAligner to mapped the annoated gene (intron+exon) to the pggb graph ?

subwaystation commented 2 years ago

Hi @baozg

You are right, you can't plug the vg deconstruct VCF directly into PanGenie. That's why @glennhickey came up with https://github.com/pangenome/resolve-nested-genotypes. This should help you to get there. Or @eblerjana knows more.

  1. You could also map all the initial sequences back to a graph and evaluate the mapping rate using something like https://github.com/pangenome/pgge. This gives you an idea of how well the graph actually represents your single sequences and haplotypes. There also is vg rna but I think it is for short reads only. Do you just want to map the sequences of the genes to the PGGB graph or do you expect sophisticated splicing awareness?

I hope @ekg can answer the other open your questions, I could take wild guesses only.

ekg commented 2 years ago

vcfbub can also be used for the nested genotype filtering.

On Thu, Dec 2, 2021, 09:45 Simon Heumos @.***> wrote:

Hi @baozg https://github.com/baozg

You are right, you can't plug the vg deconstruct VCF directly into PanGenie. That's why @glennhickey https://github.com/glennhickey came up with https://github.com/pangenome/resolve-nested-genotypes. This should help you to get there. Or @eblerjana https://github.com/eblerjana knows more.

  1. You could also map all the initial sequences back to a graph and evaluate the mapping rate using something like https://github.com/pangenome/pgge. This gives you an idea of how well the graph actually represents your single sequences and haplotypes. There also is vg rna but I think it is for short reads only. Do you just want to map the sequences of the genes to the PGGB graph or do you expect sophisticated splicing awareness?

I hope @ekg https://github.com/ekg can answer the other open your questions, I could take wild guesses only.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/145#issuecomment-984413788, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEI4YWXAO2R3M4OMKXTUO4W35ANCNFSM5JBD6ERA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

eblerjana commented 2 years ago

Hi @baozg

You are right, you can't plug the vg deconstruct VCF directly into PanGenie. That's why @glennhickey came up with https://github.com/pangenome/resolve-nested-genotypes. This should help you to get there. Or @eblerjana knows more.

3. You could also map all the initial sequences back to a graph and evaluate the mapping rate using something like https://github.com/pangenome/pgge. This gives you an idea of how well the graph actually represents your single sequences and haplotypes. There also is `vg rna` but I think it is for short reads only. Do you just want to map the sequences of the genes to the PGGB graph or do you expect sophisticated splicing awareness?

I hope @ekg can answer the other open your questions, I could take wild guesses only.

regarding PanGenie, if you remove all variants with LV>0 (i.e. keep only LV=0 records), PanGenie should work fine. This tool https://github.com/pangenome/resolve-nested-genotypes is for postprocessing PanGenie genotypes. It derives a genotype for the LV>0 variants as well.

baozg commented 2 years ago

Hi, @ekg @subwaystation

Thanks for the suggestions. The combine gfa command can work well in my side. But does the command *.gfa should change to *.smooth.gfa? (just for confirm)

ls ./*.smooth.gfa.gz  | while read i; do zcat $i >$(basename $i .gz); echo $i; done && TEMPDIR=/scratch vg ids -j $(ls *smooth*.gfa | sort -bV)
( head -1 ./chr1.smoothgfa; ( seq 22 ; echo X; echo Y; echo M ) | while read i; do cat ./chr$i.gfa | tail -n+2 ; done ) | pigz >pggb.wgg.88.gfa.gz

vg deconstruct did treat the true haplotype of my sample (in my case, tetraploid), the vcf is phased (0|0|1|1), but it have lots of >1Mb ref and alt allele. I am working the odgi depth analysis to filter out the after and complex region in the graph. vcfbub can filter these allele quickly, how do you decide the length cutoff? (Besides it only accept the bgzip indexed vcf)

@eblerjana Actually, we are working on a tetraploid graph, the PanGenie phasing cannot work on it. Can we use the PanGenie to get the ref and alt count (reduce the ref bias and allele unbalance), and regenotype with other tetraploid software?

brettChapman commented 2 years ago

Hi everyone

Sorry for jumping into this thread. This is the first I've heard about PanGenie. Is it something which should only be run after aligning short reads to the genome graph? Or can it be used on VCF derived from the topology of the entire pangenome graph (20 haplotypes in the graph)? I've been routinely using the entire VCF file derived from the topology of the graph vg deconstruct, and comparing variant calls between conventional BWA-MEM alignments to a single reference. I am intending to also align short genomic reads to the graph, so something like PanGenie would be useful to resolve the variant calls in that case.

My main concern is interpretation of the VCF derived from the topology of the graph, and if there are any "essential" post-processing steps I need to take before deploying the VCF into production. I'm loading the VCF for each haplotype into JBrowse for biologists/geneticists to refer to.

When new short read sequencing is complete on new haplotypes, I intend to align to the graph, and add further haplotypes to JBrowse based on the variant calls, and PanGenie in this case seems to be something I need to post-process those calls with, based on what I've read in this thread. Thanks.

baozg commented 2 years ago

Hi, @eblerjana

PanGenie output only include the GT:GQ:GL:KC. Can it output the ref/alt kmer count? Also, it has a diploid assumption for vg deconstruct vcf, how can I use the tetraploid phased genotype for PanGenie ?

eblerjana commented 2 years ago

PanGenie outputs the number of unique kmers it uses for each allele in the AK field in the INFO column. PanGenie only supports diploid genomes for the input/output VCF unfortunately as the underlying model is designed for diploids.

baozg commented 2 years ago

Hi, @eblerjana

I find the AK fileld have discordance with the genotype result. The AK field count are much larger than the KC, what's the difference between them ? All I need is the ref and alt count for each allele, in case can run the other software are based tetrapoildy genome. Do you think this is ok ? I have tried use the four haplotype to encode four pseduo-diploid (0->0|0,1->1|1)

AF=0.25;UK=149;AK=72,77;MA=4    GT:GQ:GL:KC     1/1:10000:-294.038,-54.9,0:4
AF=0.25;UK=0;AK=0,0;MA=4        GT:GQ:GL:KC     1/1:54:-23.3693,-5.466,-1.485e-06:0
AF=0.25;UK=11;AK=9,2;MA=4       GT:GQ:GL:KC     1/1:117:-27.1002,-11.7,-8.576e-13:0
eblerjana commented 2 years ago

The AK field just tells you the amount of unique kmers considered per allele during genotyping but this is unrelated to the final genotype. It is also unrelated to the sample that is genotyped since it only depends on the variant itself (therefore it is listed in the INFO field). What contributes to the genotype of a sample is whether these kmers were seen in the sequencing reads of the sample and how often they were seen. This information is not encoded in the VCF.

The KC field tells you the average kmer coverage in the region. In order to compute it, the count of each unique kmer in the sequencing data of the sample is determined and then the average is computed across all these counts.

I'm not sure if PanGenie is useful in a tetraploid setting, but I agree that this definitely interesting.

baozg commented 2 years ago

So the AK fileld is just the input VCF allele count, it is stable in all the genotyped sample with the same graph. Can PanGenie output the ref and alt count for regenotyping or phasing ? (like the AD field in the GATK or DeepVariant)

these kmers were seen in the sequencing reads of the sample and how often they were seen.

I think this information not only useful for tetraploidy population but also for other want to re-genotype or phasing by itself (other population but use graph to reduce the reference bias)

eblerjana commented 2 years ago

So the AK fileld is just the input VCF allele count, it is stable in all the genotyped sample with the same graph. Can PanGenie output the ref and alt count for regenotyping or phasing ? (like the AD field in the GATK or DeepVariant)

The AK field specifies the counts of unique kmers carried by the given allele (not the number of alleles). No AD field is output currently since this is difficult to define in this particular setting. The kmers are not the only aspect used by PanGenie to decide on a genotype, the panel haplotypes also play an important role here. This is a bit difficult to encode in a single AD field.

baozg commented 2 years ago

Hi @eblerjana

Thanks for explanation. PanGenie cannot use in this tetraploidy population. I have to try another pipeline for genotyping. Do you have any suggestions ?

Hi @ekg

After filtering the allele length with vcfbub, what pipeline do you choose to use ? Using the filtered vcf to construct a graph and map the short reads with griaffe ?

ekg commented 2 years ago

The freebayes genotyping model should work for polyploids. But, we don't have it hooked into anything that would aggregate the read evidence over the graph or variants from it. But, it won't use the haplotype information like PanGenie.

I think it's going to be a project to get this working. It's worth pursuing. We have a similar problem we are looking at that will require polyploid genotyping.

On Wed, Dec 8, 2021, 17:49 Zhigui Bao @.***> wrote:

Hi @eblerjana https://github.com/eblerjana

Thanks for explanation. PanGenie cannot use in this tetrapolidy population. I should choose to

Hi @ekg https://github.com/ekg

After filtering the allele length with vcfbub, what pipeline do you choose to use ? Using the filtered vcf to construct a graph and map the short reads with griaffe ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/145#issuecomment-988987330, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOVRG7RHGEMUZEBBSLUP6ECPANCNFSM5JBD6ERA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

baozg commented 2 years ago

Yes! Extension the genotyping model to polyploid is very interesting. For unrelated individual, it may need more sophisticated model to genotyping. But for us, we want to get the AD field for releated individual, then we can genotype and phase in a mapping population (F1). Do you think using the pggb graph and then use vg pipeline to genotyping variants is suitable?

subwaystation commented 2 years ago

I thought it's feasible, but maybe @joehagmann can comment here? He already tried some downstream genotyping.