Open brettChapman opened 1 year ago
vg
has a concept of "path sense", where paths can be "REFERENCE" paths or "HAPLOTYPE" paths.
This comes from the GBZ file format, where HAPLOTYPE paths can be stored extremely efficiently (you can have thousands) but this comes at the cost of not supporting all types of queries on them.
Minigraph-cactus follows this convention in that only samples you specifiy with --reference
end up as REFERENCE paths. And consequently, vg call
can only use these REFERENCE paths to call variants on.
Now if you only have 7 genomes, you can just set them all as REFERENCE and it'll probably be okay. The easiest way to do this is probably to take the output GFA from minigraph-cactus, and change the "RS" tag in the first line to include all your samples (separated by spaces):
H VN:Z:1.1 RS:Z:SAMPLE1 SAMPLE2 SAMPLE3 ...
From there you can make a new .gbz with
vg gbwt -G gfa-with-new-header.gfa --gbz-format -g new.gbz
And then use that with vg pack / call
(you can re-use your GAM).
The new .gbz will be the same as the old one, except the path sense will be changed to REFERENCE for everything.
Hi @glennhickey
Thanks that helped a lot. I decided just to run again and provide --reference L1_v1 L2_v1 L3_v1 L4_v1 YL_28698_v1 YL_28815_v1 YL_29181_v1
with all the haplotypes I'm looking at. The final variant calls, after filtering any variants which didn't pass the filter, I found are heavily biased to the first haplotype, which is my original reference I used. Does this mean any variant calls on other haplotypes are unique to that haplotype only? How can I interpret variant calls from a pangenome graph? I used the default mapping approach, which uses mash distance to determine alignment order.
I did a uniq -c to count the number of calls per chromosome/haplotype. E.g.
633 Llute_Paris_Chr17
7 Llute_Paris_Chr18
7 Llute_Paris_Chr19
69 Llute_Paris_Chr20
10 Llute_Paris_Chr21
24 Llute_Paris_Chr24
4 Llute_Paris_Chr25
3 Llute_Paris_Chr26
269812 Llute_Wodjil_Chr01
347052 Llute_Wodjil_Chr02
342285 Llute_Wodjil_Chr03
310478 Llute_Wodjil_Chr04
351581 Llute_Wodjil_Chr05
293075 Llute_Wodjil_Chr06
L1_v1 is Llute_Wodjil and L2_v1 is Llute_Paris
You can see Llute_Wodjil is heavily biased. Is this expected? This is for only one sample. I'll be aligning hundreds more samples and then creating a multi sample VCF. Thanks.
The first reference does get special treatment and there will be indeed a bit of bias towards it.
But I'm not quite sure what you're doing and what you're expecting. To get full calls on all samples, I think you will need to run vg call
once for each sample. If you are just running vg call
once, I think it will just take one sample per bubble.
Hi @glennhickey sorry its taken a while to get back to this.
I've been aligning each sample to the graph, then calling variants to produce a VCF for each sample. I then merge the VCF files to produce a single VCF with multiple samples.
For each of the reference/reference-sense paths in the graph, I've split the VCF into different references. You can see that there is a huge difference between different references, with Llute_Wodjil (the first reference) heavily biased:
-rw-rw-r-- 1 ubuntu ubuntu 88M Dec 7 03:42 Llute_27849_YL_pangenome.vcf
-rw-rw-r-- 1 ubuntu ubuntu 63M Dec 7 03:59 Llute_28213_YL_pangenome.vcf
-rw-rw-r-- 1 ubuntu ubuntu 6.9M Dec 7 04:17 Llute_28698_YL_pangenome.vcf
-rw-rw-r-- 1 ubuntu ubuntu 29M Dec 7 04:34 Llute_28815_YL_pangenome.vcf
-rw-rw-r-- 1 ubuntu ubuntu 2.2M Dec 7 04:52 Llute_29181_YL_pangenome.vcf
-rw-rw-r-- 1 ubuntu ubuntu 128M Dec 7 05:09 Llute_Paris_YL_pangenome.vcf
-rw-rw-r-- 1 ubuntu ubuntu 8.9G Dec 7 05:38 Llute_Wodjil_YL_pangenome.vcf
Does this mean that variants identified in one reference-sense path (wtih Llute_Wodjil set as the first refeence), are completely unique, and any variants identified in say Llute_Wodjil would be a kind of consensus variant (found in all paths, or only on Llute_Wodjil? I'm trying to understand how these VCF would differ if I were to align to each reference usng conventional approaches instread, such as BWA-MEM followed by GATK. In this way could I consider variants identified to Llute_Wodjil as core pangenome sequence regions, and anything aligning to the other reference-sense paths considered as non-core shell regions found in only a subset. Thanks.
Hi @glennhickey
Ideally I would like a multiple sample VCF file. I've realised to really looking at population structure, we'll need a gVCF to get reference alleles in the VCF after I've merged all the VCF together, otherwise there will be missing calls when they should be reference alleles. Is there a way to get vg call
to perform a variant call on multiple samples at once or is that for a different tool such as vg genotype
?
We don't presently have a way to call or genotype multiple samples at once in vg
. You'd have to project to BAM then use another tool.
For your previous comment about the different VCF sizes, are you running vg call
once for each? Ie vg call -S Llute_Wodjil
, vg call -S Llute_Paris
etc. ? That would be the only way to get a fair comparison across samples.
If you're doing that and still seeing a big difference, it could be your other samples did not align very well into the graph. You can use vg paths -E
(and I think minigraph-cactus outputs some stats files to this effect too) to see how much of each sample actually made it into your final graph.
For the gVCF, if you run call with -a
then combine with bcftools merge -0
, you should get reference calls in the final output.
Hi @glennhickey
Thanks. Adding -a and then merging with -0 is working to bring all the samples together properly.
I was wondering, is it possible to iteratively align WGS reads to the graph until a final graph has all samples within the graph with their own paths. I was considering could this be used in conjunction with ODGI PAV to get presence absence across a huge number of haplotypes and WGS samples aligned? Usually I have used PGGB to generate a graph, ODGI PAV to create a PAV matrix across a relatively small subset, and wondered if aligning hundreds of samples to the graph, could I then get a PAV matrix based on aligned samples as well, which would be much more comprehensive across a population.
Also if it possible to infer a phylogenetic tree from the final pangenome graph with hundreds of samples aligned? I usually run mashtree across all my assembled genomes of pseudoomolecule quality, to get a newick tree, but also wondered if I aligned WGS data to the graph, is there a tool which can interpret phylogeny based on the graph alone? Thanks.
The only effective way I"m aware of now to make pangenome graphs from WGS is in one shot from VCF with vg construct
.
For what you're doing, I think you'd either need to try to get GAF
support added to odgi pav
, or work with with VCF (ie from vg call/deconstruct
) and do your matrix from that.
Thanks. I remember trying to generate a graph from VCF once before but found that my chromosome lengths were too long. I have to use CSI indexing for my genomes. Has this been updated in the latest vg toolkit build?
Unfortunately, it doesn't look like it, as the issues remain open.
https://github.com/vcflib/vcflib/issues/367 https://github.com/vgteam/vg/issues/3864
Ok thanks. Hopefully its implemented at some point.
I'm currently looking at comparing results between PGGB and MC. Specifically how they compare with identified variants and aligning samples with giraffe. My pangenome in this case is relatively small (7 lupin accessions), and I believe vg autoindex will likely complete, unlike with previous attempts with much larger pangenomes (~20 barley accessions)
The d2.gbz outputs from MC, how are they generated? I assume clipping is done so that nodes have a minimmum depth of 2, to avoid edge cases. Which tools and commands are used to clip out these edge cases?
I'd like to generate similar d2.gbz output and other necessary files using my PGGB graph and vg autoindex.
A few things:
giraffe
are no longer to use the .dX
filtered graphs. It's to use the --haplo clip --giraffe -clip
options, then use the resulting .hapl
index to make a personalized pangenome. There's a preprint about that here: https://www.biorxiv.org/content/10.1101/2023.12.13.571553v2. I really need to update the MC documentation accordingly -- but it all works fine in the latest Cactus release. .hapl
methodology above), is that the tools used in MC are dependent on the graph being linearized on a reference genome. PGGB graphs don't have this property, and therefore don't work as well with these tools. I think a major issue was that even after filtering, the PGGB graphs were too complex for giraffe
's distance index to be effective.vg clip -d
is used for the depth filtering. The rest you could probably figure out by looking at the mc log
s or code. Or even better, just try running cactus-graphmap-join
on your PGGB graph.Hi @glennhickey
Thanks. I've tried generating a .hapl
file using those parameters, but I get argument errors
I use --haplo clip --giraffe --clip
it complains that --clip
needs 1 argument
[--badWorker BADWORKER]
[--badWorkerFailInterval BADWORKERFAILINTERVAL]
--vg VG [VG ...] [--hal HAL [HAL ...]] --outDir
OUTDIR --outName OUTNAME --reference REFERENCE
[REFERENCE ...] [--clip CLIP] [--filter FILTER]
[--gfa [GFA ...]] [--gbz [GBZ ...]]
[--odgi [ODGI ...]] [--viz [VIZ ...]]
[--draw [DRAW ...]] [--chrom-vg [CHROM_VG ...]]
[--chrom-og [CHROM_OG ...]] [--vcf [VCF ...]]
[--vcfReference VCFREFERENCE [VCFREFERENCE ...]]
[--vcfbub VCFBUB] [--giraffe [GIRAFFE ...]]
[--haplo [HAPLO ...]] [--indexCores INDEXCORES]
[--indexMemory INDEXMEMORY] [--chop [CHOP]]
[--configFile CONFIGFILE] [--latest]
[--containerImage CONTAINERIMAGE]
[--binariesMode {docker,local,singularity}]
jobStore
cactus-graphmap-join: error: argument --clip: expected one argument
I simply reran cactus-graphmap-join
from my previous run and output to a different directly than before.
cactus-graphmap-join /cactus/jobStore --indexCores 2 --vg ${original_folder}/lupin-pg/*.vg --hal ${original_folder}/lupin-pg/*.hal --outDir ${original_folder}/lupin-pg_clip --outName lupin-pg_clipped --reference ${REFERENCE} --haplo clip --giraffe --clip clip --chrom-vg --chrom-og --gbz --gfa --viz --draw --disableCaching --workDir=/cactus/workDir --clean always --cleanWorkDir always --defaultDisk 1000G --maxDisk 1000G --maxCores 32 --maxMemory 126G --defaultMemory 126G
Once I get a .hapl
file, how do I then go on to create a personalised pangenome? Do I simply convert to a GFA and use with `giraffe
and other mapping tools like before? Thanks.
Sorry that's a typo on my part. You make the .hapl
index by adding --haplo clip --giraffe clip
. You use the .hapl
index as described here -- see "Giraffe integration (requires vg version 1.51.0)" instructions.
There is also an example of doing all this in this tutorial in this section.
Hi @glennhickey
Thanks. From this workflow it looks like for each sample I need to create a personalised pangenome.
Could I create a personalised pangenome from all the reads using kmc or would I have to do kmer counting for each read set prior to mapping the reads? I have around 400 samples, R1 and R2. Thanks.
Yeah, the idea is to make a personalized genome for each sample. As shown in that preprint, it doesn't affect the overall running time much: the cost of making the personal pangenome is recovered by faster mapping time.
Not that there's anything preventing you from making a personal genome for a set of samples together -- it's just we haven't really benchmarked this nor looked at the best parameters, and I wouldn't expect it to work as well as the per-sample approach.
Thanks. I'll stick to a personalied pangenome per sample.
Is there a way to generate a hapl file from the GBZ file? I've been using a more up to date vg tookit than the one that comes with MC.
I've been getting these errors before with giraffe (below), and read on another issue post that it would be best to generate the min and dist indexes again from the GBZ prior to running giraffe. I'd imagine it would be best to do the same for the hapl files as well.
warning[vg::giraffe]: Refusing to perform too-large rescue alignment of 150 bp against 12776 bp ordered subgraph for read V350045396L1C001R0070424179/1 which would use more than 1572864 cells and might exhaust Dozeu's allocator; suppressing further warnings.
Mapped 28606922 reads across 32 threads in 438.642 seconds with 0.983682 additional single-threaded seconds.
You can generally ignore most giraffe warnings, including that one. Cactus includes the latest vg release. .hapl
generation is covered in the wiki linked to above.
Hi @glennhickey
Just looking over the workflow for the personalised pangenome. Is the aim to use the hapl file and the kmer counts with girraffe with the unfiltered pangenome graph (GBZ). Or do I use giraffe (-Z parameter) with the personalised pangenome graph which is output by a previous run of giraffe?
for example, theres one example which is from manual sampling and another which integrates with giraffe, both seem a bit different. Which do I use? I assume for both I need to run vg haplotypes -v 2 -t 32 -d graph.dist -H graph.hpl graph.gbz
before hand.
Manual sampling, temporary graph:
export TMPDIR=/scratch/tmp
kmc -k29 -m128 -okff -t16 -hp sample.fq.gz ${TMPDIR}/sample $TMPDIR
vg haplotypes -v 2 -t 16 --include-reference \
-i graph.hapl -k ${TMPDIR}/sample.kff -g ${TMPDIR}/sampled.gbz graph.gbz
vg giraffe -p -t 16 -Z ${TMPDIR}/sampled.gbz -i -f sample.fq.gz > sample.gam
Giraffe integration (requires vg version 1.51.0):
export TMPDIR=/scratch/tmp
kmc -k29 -m128 -okff -t16 -hp sample.fq.gz ${TMPDIR}/sample $TMPDIR
vg giraffe -p -t 16 -Z graph.gbz --haplotype-name graph.hapl --kmer-name ${TMPDIR}/sample.kff \
-N sample -i -f sample.fq.gz > sample.gam
what does the -i parameter do? I cant see it in the usage list.
Once I have the resulting GAM file using the HAPL file using the Girraffe integrartion approach above, I want to augment the genome graph with the GAM file to call variants on.
I usually do this:
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg convert graph.gbz -p > graph.pg
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg augment graph.pg ${sample}.gam -m 4 -q 5 -Q 5 -A ${sample}.aug.gam > ${sample}.aug.pg
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg snarls ${sample}.aug.pg > ${sample}.aug.snarls
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg pack -x ${sample}.aug.pg -g ${sample}.aug.gam -o ${sample}.aug.pack
reference_paths=$(srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg paths -x ${sample}.aug.pg -L | cut -d '[' -f 1 | sort | uniq | tr '\n' ' ' | sed 's/ / -p /g' | sed 's/ -p $/\n/g')
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg call ${sample}.aug.pg -r ${sample}.aug.snarls -k ${sample}.aug.pack -a -p ${reference_paths} -s ${sample} > ${sample}.aug.vcf
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${TABIX_IMAGE} bgzip -@ 16 ${sample}.aug.vcf
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${TABIX_IMAGE} tabix ${sample}.aug.vcf.gz
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${BCFTOOLS_IMAGE} bcftools view -f PASS ${sample}.aug.vcf.gz > ${sample}.aug.final.vcf
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${TABIX_IMAGE} bgzip -@ 16 ${sample}.aug.final.vcf
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${TABIX_IMAGE} tabix ${sample}.aug.final.vcf.gz
Should the GBZ file when augmenting with the GAM file, be a personalised GBZ file or the unfiltered GBZ file? If the personalised GBZ file, then I should take the manual sampling approach to create a temporary graph ${sample}.gbz, instead of the Giraffe integration example above?
Hi @glennhickey I've been getting an error about r-index when running giraffe:
Loading GBZ from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.gbz
Generating haplotype information
Loading distance index from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.dist
Building minimizer index
Built the minimizer index in 109.653 seconds
Guessing that r-index is /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.ri
Loading r-index from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.ri
error[VPKG::load_one]: Could not open /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.ri while loading gbwt::FastLocate
srun: error: node-5: task 0: Exited with exit code 1
I run this code which gives me the error, when trying to generate the HAPL file:
vg haplotypes -v 2 -t 32 -d ${DIST_file} -H ${HAPL_file} ${GBZ_file}
I need the HAPL file so I can run:
vg giraffe -p -t 32 -Z ${GBZ_file} --haplotype-name ${HAPL_file} --kff-name ${sample}.kff -N ${sample} -f ${sample}_1.trimmed.fq -f ${sample}_2.trimmed.fq > ${sample}.gam
It looks like you don't have the r-index file. You can build it like this:
vg gbwt -p --num-threads 16 -r graph.ri -Z graph.gbz
Thanks @jltsiren
Hi @glennhickey and @jltsiren
I've managed to get further along, but now get this error:
Loading GBZ from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.gbz
Generating haplotype information
Loading distance index from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.dist
Building minimizer index
Built the minimizer index in 87.2338 seconds
Guessing that r-index is /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.ri
Loading r-index from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.ri
Determining construction jobs
error: [vg haplotypes] HaplotypePartitioner::partition_haplotypes(): there are 49 top-level chains and 26 weakly connected components; haplotype sampling cannot be used with this graph
This is my workflow up untll giraffe:
vg index ${GBZ_file} -j ${DIST_file}
vg gbwt -p --num-threads 32 -r ${RINDEX_file} -Z ${GBZ_file}
vg haplotypes -v 2 -t 32 -d ${DIST_file} -H ${HAPL_file} ${GBZ_file}
kmc -k29 -m128 -okff -t32 @${sample}.reads.txt ${sample} ${PWD}/kmc_tmp/
vg giraffe -p -t 32 -Z ${GBZ_file} --haplotype-name ${HAPL_file} --kff-name ${sample}.kff -N ${sample} -f ${sample}_1.trimmed.fq -f ${sample}_2.trimmed.fq > ${sample}.gam
Haplotype sampling expects a graph with a linear high-level structure. The error message indicates that your graph lacks that structure. There are 26 weakly connected components, which should correspond to chromosomes. On the other hand, there are 49 top-level chains, which are the linear structures haplotype sampling uses for generating haplotypes. Because the numbers don't match, many of the internal assumptions don't hold and the algorithm doesn't work.
If you want to use haplotype sampling, you need to rebuild the graph so that the chromosomes / other contigs are clearly separated and it's possible to find end-to-end paths over each contig.
Hi @jltsiren
Thanks. I have 26 chromosomes for each of the 7 genomes. There are no contigs, only 26 chromosomes for each of the 7. I ran all of these together with minigraph-cactus, and then ran the join method at the end which merges all chromosomes together. Perhaps I should run the mapping stage on each chromosome manually first and then join afterwards. I've done this for much larger pangenomes, but because this genome is much smaller, I ran it all together.
If I rebuild the graph, what exactly should I be changing? The only thing I can see which could change is the join stage:
srun -n 1 singularity exec --cleanenv \
--no-home \
--overlay ${JOBSTORE_IMAGE} \
--bind ${CACTUS_SCRATCH}/tmp:/tmp \
${CACTUS_IMAGE} cactus-graphmap-join /cactus/jobStore --indexCores 2 --vg ${original_folder}/lupin-pg/*_Chr*.vg --hal ${original_folder}/lupin-pg/*_Chr*.hal --outDir ${original_folder}/lupin-pg --outName lupin-pg --reference ${REFERENCE} --giraffe --chrom-vg --chrom-og --gbz --gfa --viz --draw --disableCaching --workDir=/cactus/workDir --clean always --cleanWorkDir always --defaultDisk 1000G --maxDisk 1000G --maxCores 32 --maxMemory 126G --defaultMemory 126G
Also worth noting, --reference
I have each path in the graph as a reference, so when calling variants, I get variant calls against each path, and not just one of the genomes.
The output of minigraph-cactus should always be compatible with vg hapltoypes
. So I think this is either a bug or you are using an old version of Cactus. Which version are you using? Note that there is also the --haplo
option to have mc make the haplotypes index for you.
Hi @glennhickey
The version I have is from November last year.
I'll try with the --haplo
parameter. If all else fails, I'll update to the latest version and rerun from the beginning.
Hi @glennhickey
Should I use --haplo clip --giraffe clip
if I'm planning on doing haplotype sampling? or --haplo full --giraffe full
? or doesn't it matter?
You want to use clip
. I just added a bit of documentation here:
https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md#haplotype-sampling-instead-of-filtering-new
Hi
I'm using vg tools v1.50.1 "Monopoli"
I've produced a graph from 7 different genomes of the same species using minigraph-cactus.
I've used the indexes produced from minigraph-cactus with
--giraffe`` parameter and obtained the
.d2.gbz``` index.I then map WGS reads from a different species entirely using giraffe, and follow the tutorial here: https://github.com/vgteam/vg/wiki/SV-Genotyping-and-variant-calling
I've found the resulting VCF, whether I use an augmented graph or not, still gives the calls against the reference I used when I ran minigraph-cactus, and none of the other 6 haplotypes my reads would have also mapped to.
What I want is the VCF to contain variant calls against all different paths in the graph, not just the reference I chose, but to provide variant calls against all 7 different haplotypes in the graph. I want to use the graph as a single point of call for variation across the pangenome, but yet I still get variant calls against only 1 of the 7 genomes (the reference used in minigraph-cactus). Am I missing something, a particular step, or is this expected when using minigraph-cactus? Should I use PGGB, if I want calls against all haplotypes, as PGGB maps all haplotypes against all haplotypes and has no single reference.
Thanks.