Closed baozg closed 2 years ago
Hi @baozg
I am not sure I understood all your questions properly, but let me try.
vg combine
there also is odgi squeeze
. Link to the Docs is https://odgi.readthedocs.io/en/latest/rst/commands/odgi_squeeze.html. vg deconstruct
for our per chromosome PGGB HPRC graphs. Example command line vg deconstruct -P chm13 -H # -e -a -t 48 chr10.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa
. The VCFs can be plugged into e.g. https://bitbucket.org/jana_ebler/pangenie/src/master/, but that is still an ongoing research process.vg
. They should provide tutorials for the type of analysis you have in mind.
@ChriKub already built a pangenome graph from A.thaliana. Maybe he has some tips for you regarding the usage of vg giraffe
.All the best, Simon
For long read mapping, there is https://github.com/maickrau/GraphAligner available.
Hi @baozg,
We do generate phased VCFs from pggb. The approach uses a feature in vg deconstruct
that allows us to assign contigs to sample phase sets based on name prefixes (-H #
where #
is a delimiter in the path name): https://github.com/pangenome/pggb/blob/master/pggb#L531-L532
We use the Pangenome Sequence Naming convention in our graphs. This can be applied to the input FASTA to pggb, and then you don't need to worry about linking contigs to particular sample haplotypes at any subsequent stage of analysis. It ensures compatibility with all existing tools based on FASTA references and our pangenome methods. You can also rewrite the path names (P-lines) to match this, if you've already made graphs. The naming for you (assuming you have a tetraploid for instance) would be like sample1#0#contig10
for a contig on the first haplotype or sample1#3#contig99
for a contig on the 4th haplotype. I believe that vg deconstruct -H '#' -e -a
will determine the ploidy automatically, but you may want to check this and set ploidy specifically with -p
if the detection doesn't work.
To merge chromosome graphs, I would suggest using vg ids -j
to generate a combined ID space for all the GFAs. You can then combine these with cat.
I've been using this approach to combine the graphs:
ls ../*/*.smooth.gfa.gz | while read i; do zcat $i >$(basename $i .gz); echo $i; done && TEMPDIR=/scratch vg ids -j $(ls *smooth*.gfa | sort -bV)
( head -1 ../views/chr1.gfa; ( seq 22 ; echo X; echo Y; echo M ) | while read i; do cat ../views/chr$i.gfa | tail -n+2 ; done ) | pigz >pggb.wgg.88.gfa.gz
The id-space joined graphs are then usable by vg deconstruct
. Doing this ensures that the node IDs in the VCF files (the record id and some annotations include these) match the combined graph. I used this script:
#!/bin/bash
input=$1
vcf_spec=$2
threads=$3
vg=vg-375cad7
timer=/home/erikg/.guix-profile/bin/time
fmt="%C\n%Us user %Ss system %P cpu %es total %MKb max memory"
for s in $( echo "$vcf_spec" | tr ',' ' ' );
do
ref=$s
echo "[vg::deconstruct] making VCF with reference=$ref"
vcf=$(basename input .gfa).$(echo $ref | tr '/|' '_').vcf
( TEMPDIR=$(pwd) $timer -f "$fmt" $vg deconstruct -P $ref \
-H '#' -e -a -t $threads "$input" >"$vcf" )
pigz -p $threads $vcf
done
Hope this helps! Please follow up with questions or confirmation that it works.
Hi @subwaystation @ekg
Thanks for the promptly and detailed reply. I will try it right now. I have noticed the PanSN-Spec in the repo. I have already modify the chromosome name to meet the specification. I am suprised that PanGenie
have accepted the vg decontruct
VCF.
I will update the result when I try the command. But I still have several questions.
For haplotype id, it should be 0-based or 1-based ? Does it matter ? Can we mixed the different ploidy assembly (diploid and haploid both)?
odgi
is very powerful for exploring the graph pangenome in more way. It have a easy to use tutorial and documentation I already try it follow the tutorial. I will try it to explore the pggb graph. But for user, it's not easy to figure out the which graph structure is optimal. We can have different graph for a complex region, which is more prevalent in the plant genome. Can I compare the VCF of vg deconsturct
with the VCF based Deepariant
with binned HiFi reads (seperate different haplotype and do multi-sample calling)? Is the genotype comparable in two callsets?
The percentage of mapped annotated gene can also reflect the graph completeness (Like we did BUSCO or asmgene
in the linear assembly). I have tried the minigraph
for mapping gene with -x lr
, so can I use the GraphAligner
to mapped the annoated gene (intron+exon) to the pggb graph ?
Hi @baozg
You are right, you can't plug the vg deconstruct
VCF directly into PanGenie
. That's why @glennhickey came up with https://github.com/pangenome/resolve-nested-genotypes. This should help you to get there. Or @eblerjana knows more.
vg rna
but I think it is for short reads only. Do you just want to map the sequences of the genes to the PGGB graph or do you expect sophisticated splicing awareness? I hope @ekg can answer the other open your questions, I could take wild guesses only.
vcfbub can also be used for the nested genotype filtering.
On Thu, Dec 2, 2021, 09:45 Simon Heumos @.***> wrote:
Hi @baozg https://github.com/baozg
You are right, you can't plug the vg deconstruct VCF directly into PanGenie. That's why @glennhickey https://github.com/glennhickey came up with https://github.com/pangenome/resolve-nested-genotypes. This should help you to get there. Or @eblerjana https://github.com/eblerjana knows more.
- You could also map all the initial sequences back to a graph and evaluate the mapping rate using something like https://github.com/pangenome/pgge. This gives you an idea of how well the graph actually represents your single sequences and haplotypes. There also is vg rna but I think it is for short reads only. Do you just want to map the sequences of the genes to the PGGB graph or do you expect sophisticated splicing awareness?
I hope @ekg https://github.com/ekg can answer the other open your questions, I could take wild guesses only.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/145#issuecomment-984413788, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEI4YWXAO2R3M4OMKXTUO4W35ANCNFSM5JBD6ERA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Hi @baozg
You are right, you can't plug the
vg deconstruct
VCF directly intoPanGenie
. That's why @glennhickey came up with https://github.com/pangenome/resolve-nested-genotypes. This should help you to get there. Or @eblerjana knows more.3. You could also map all the initial sequences back to a graph and evaluate the mapping rate using something like https://github.com/pangenome/pgge. This gives you an idea of how well the graph actually represents your single sequences and haplotypes. There also is `vg rna` but I think it is for short reads only. Do you just want to map the sequences of the genes to the PGGB graph or do you expect sophisticated splicing awareness?
I hope @ekg can answer the other open your questions, I could take wild guesses only.
regarding PanGenie, if you remove all variants with LV>0 (i.e. keep only LV=0 records), PanGenie should work fine. This tool https://github.com/pangenome/resolve-nested-genotypes is for postprocessing PanGenie genotypes. It derives a genotype for the LV>0 variants as well.
Hi, @ekg @subwaystation
Thanks for the suggestions. The combine gfa command can work well in my side. But does the command *.gfa
should change to *.smooth.gfa
? (just for confirm)
ls ./*.smooth.gfa.gz | while read i; do zcat $i >$(basename $i .gz); echo $i; done && TEMPDIR=/scratch vg ids -j $(ls *smooth*.gfa | sort -bV)
( head -1 ./chr1.smoothgfa; ( seq 22 ; echo X; echo Y; echo M ) | while read i; do cat ./chr$i.gfa | tail -n+2 ; done ) | pigz >pggb.wgg.88.gfa.gz
vg deconstruct
did treat the true haplotype of my sample (in my case, tetraploid), the vcf is phased (0|0|1|1), but it have lots of >1Mb ref and alt allele. I am working the odgi depth analysis to filter out the after and complex region in the graph.
vcfbub
can filter these allele quickly, how do you decide the length cutoff? (Besides it only accept the bgzip indexed vcf)
@eblerjana
Actually, we are working on a tetraploid graph, the PanGenie
phasing cannot work on it. Can we use the PanGenie
to get the ref and alt count (reduce the ref bias and allele unbalance), and regenotype with other tetraploid software?
Hi everyone
Sorry for jumping into this thread. This is the first I've heard about PanGenie. Is it something which should only be run after aligning short reads to the genome graph? Or can it be used on VCF derived from the topology of the entire pangenome graph (20 haplotypes in the graph)? I've been routinely using the entire VCF file derived from the topology of the graph vg deconstruct
, and comparing variant calls between conventional BWA-MEM alignments to a single reference. I am intending to also align short genomic reads to the graph, so something like PanGenie would be useful to resolve the variant calls in that case.
My main concern is interpretation of the VCF derived from the topology of the graph, and if there are any "essential" post-processing steps I need to take before deploying the VCF into production. I'm loading the VCF for each haplotype into JBrowse for biologists/geneticists to refer to.
When new short read sequencing is complete on new haplotypes, I intend to align to the graph, and add further haplotypes to JBrowse based on the variant calls, and PanGenie in this case seems to be something I need to post-process those calls with, based on what I've read in this thread. Thanks.
Hi, @eblerjana
PanGenie
output only include the GT:GQ:GL:KC
. Can it output the ref/alt kmer count? Also, it has a diploid assumption for vg deconstruct
vcf, how can I use the tetraploid phased genotype for PanGenie
?
PanGenie outputs the number of unique kmers it uses for each allele in the AK field in the INFO column. PanGenie only supports diploid genomes for the input/output VCF unfortunately as the underlying model is designed for diploids.
Hi, @eblerjana
I find the AK
fileld have discordance with the genotype result. The AK
field count are much larger than the KC, what's the difference between them ? All I need is the ref and alt count for each allele, in case can run the other software are based tetrapoildy genome. Do you think this is ok ? I have tried use the four haplotype to encode four pseduo-diploid (0->0|0,1->1|1)
AF=0.25;UK=149;AK=72,77;MA=4 GT:GQ:GL:KC 1/1:10000:-294.038,-54.9,0:4
AF=0.25;UK=0;AK=0,0;MA=4 GT:GQ:GL:KC 1/1:54:-23.3693,-5.466,-1.485e-06:0
AF=0.25;UK=11;AK=9,2;MA=4 GT:GQ:GL:KC 1/1:117:-27.1002,-11.7,-8.576e-13:0
The AK field just tells you the amount of unique kmers considered per allele during genotyping but this is unrelated to the final genotype. It is also unrelated to the sample that is genotyped since it only depends on the variant itself (therefore it is listed in the INFO field). What contributes to the genotype of a sample is whether these kmers were seen in the sequencing reads of the sample and how often they were seen. This information is not encoded in the VCF.
The KC field tells you the average kmer coverage in the region. In order to compute it, the count of each unique kmer in the sequencing data of the sample is determined and then the average is computed across all these counts.
I'm not sure if PanGenie is useful in a tetraploid setting, but I agree that this definitely interesting.
So the AK
fileld is just the input VCF allele count, it is stable in all the genotyped sample with the same graph. Can PanGenie
output the ref and alt count for regenotyping or phasing ? (like the AD
field in the GATK or DeepVariant)
these kmers were seen in the sequencing reads of the sample and how often they were seen.
I think this information not only useful for tetraploidy population but also for other want to re-genotype or phasing by itself (other population but use graph to reduce the reference bias)
So the
AK
fileld is just the input VCF allele count, it is stable in all the genotyped sample with the same graph. CanPanGenie
output the ref and alt count for regenotyping or phasing ? (like theAD
field in the GATK or DeepVariant)
The AK field specifies the counts of unique kmers carried by the given allele (not the number of alleles). No AD field is output currently since this is difficult to define in this particular setting. The kmers are not the only aspect used by PanGenie to decide on a genotype, the panel haplotypes also play an important role here. This is a bit difficult to encode in a single AD field.
Hi @eblerjana
Thanks for explanation. PanGenie
cannot use in this tetraploidy population. I have to try another pipeline for genotyping. Do you have any suggestions ?
Hi @ekg
After filtering the allele length with vcfbub
, what pipeline do you choose to use ? Using the filtered vcf to construct a graph and map the short reads with griaffe
?
The freebayes genotyping model should work for polyploids. But, we don't have it hooked into anything that would aggregate the read evidence over the graph or variants from it. But, it won't use the haplotype information like PanGenie.
I think it's going to be a project to get this working. It's worth pursuing. We have a similar problem we are looking at that will require polyploid genotyping.
On Wed, Dec 8, 2021, 17:49 Zhigui Bao @.***> wrote:
Hi @eblerjana https://github.com/eblerjana
Thanks for explanation. PanGenie cannot use in this tetrapolidy population. I should choose to
Hi @ekg https://github.com/ekg
After filtering the allele length with vcfbub, what pipeline do you choose to use ? Using the filtered vcf to construct a graph and map the short reads with griaffe ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/145#issuecomment-988987330, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOVRG7RHGEMUZEBBSLUP6ECPANCNFSM5JBD6ERA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Yes! Extension the genotyping model to polyploid is very interesting. For unrelated individual, it may need more sophisticated model to genotyping. But for us, we want to get the AD field for releated individual, then we can genotype and phase in a mapping population (F1). Do you think using the pggb
graph and then use vg
pipeline to genotyping variants is suitable?
I thought it's feasible, but maybe @joehagmann can comment here? He already tried some downstream genotyping.
Hi,
I have run the pggb for each chromosome fasta of pangenome. Does the HPRC graph have any downstream use in vg pipeline?
I have several questions.
vg genotyping and calling based the
snarl
result of vg, if I want to compare the VCF with the pggb pangenome graph. How can I use gfa forvg deconstruct
? I try the follwing code but failed to produce a vcf.But with the
cactus
vg graph, I can use thedeconstruct
to produce a vcf with multi-sample genotype.How to evalute the different parameter set for graph building?
Thanks, Zhigui Bao