vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 195 forks source link

Issue with variant calls from minigrah-cactus graph #4062

Open brettChapman opened 1 year ago

brettChapman commented 1 year ago

Hi

I'm using vg tools v1.50.1 "Monopoli"

I've produced a graph from 7 different genomes of the same species using minigraph-cactus.

I've used the indexes produced from minigraph-cactus with --giraffe`` parameter and obtained the.d2.gbz``` index.

I then map WGS reads from a different species entirely using giraffe, and follow the tutorial here: https://github.com/vgteam/vg/wiki/SV-Genotyping-and-variant-calling

I've found the resulting VCF, whether I use an augmented graph or not, still gives the calls against the reference I used when I ran minigraph-cactus, and none of the other 6 haplotypes my reads would have also mapped to.

What I want is the VCF to contain variant calls against all different paths in the graph, not just the reference I chose, but to provide variant calls against all 7 different haplotypes in the graph. I want to use the graph as a single point of call for variation across the pangenome, but yet I still get variant calls against only 1 of the 7 genomes (the reference used in minigraph-cactus). Am I missing something, a particular step, or is this expected when using minigraph-cactus? Should I use PGGB, if I want calls against all haplotypes, as PGGB maps all haplotypes against all haplotypes and has no single reference.

Thanks.

glennhickey commented 1 year ago

vg has a concept of "path sense", where paths can be "REFERENCE" paths or "HAPLOTYPE" paths.

This comes from the GBZ file format, where HAPLOTYPE paths can be stored extremely efficiently (you can have thousands) but this comes at the cost of not supporting all types of queries on them.

Minigraph-cactus follows this convention in that only samples you specifiy with --reference end up as REFERENCE paths. And consequently, vg call can only use these REFERENCE paths to call variants on.

Now if you only have 7 genomes, you can just set them all as REFERENCE and it'll probably be okay. The easiest way to do this is probably to take the output GFA from minigraph-cactus, and change the "RS" tag in the first line to include all your samples (separated by spaces):

H   VN:Z:1.1    RS:Z:SAMPLE1 SAMPLE2 SAMPLE3 ...

From there you can make a new .gbz with

vg gbwt -G gfa-with-new-header.gfa --gbz-format -g new.gbz

And then use that with vg pack / call (you can re-use your GAM).

The new .gbz will be the same as the old one, except the path sense will be changed to REFERENCE for everything.

brettChapman commented 1 year ago

Hi @glennhickey

Thanks that helped a lot. I decided just to run again and provide --reference L1_v1 L2_v1 L3_v1 L4_v1 YL_28698_v1 YL_28815_v1 YL_29181_v1 with all the haplotypes I'm looking at. The final variant calls, after filtering any variants which didn't pass the filter, I found are heavily biased to the first haplotype, which is my original reference I used. Does this mean any variant calls on other haplotypes are unique to that haplotype only? How can I interpret variant calls from a pangenome graph? I used the default mapping approach, which uses mash distance to determine alignment order.

I did a uniq -c to count the number of calls per chromosome/haplotype. E.g.

   633 Llute_Paris_Chr17
      7 Llute_Paris_Chr18
      7 Llute_Paris_Chr19
     69 Llute_Paris_Chr20
     10 Llute_Paris_Chr21
     24 Llute_Paris_Chr24
      4 Llute_Paris_Chr25
      3 Llute_Paris_Chr26
 269812 Llute_Wodjil_Chr01
 347052 Llute_Wodjil_Chr02
 342285 Llute_Wodjil_Chr03
 310478 Llute_Wodjil_Chr04
 351581 Llute_Wodjil_Chr05
 293075 Llute_Wodjil_Chr06

L1_v1 is Llute_Wodjil and L2_v1 is Llute_Paris

You can see Llute_Wodjil is heavily biased. Is this expected? This is for only one sample. I'll be aligning hundreds more samples and then creating a multi sample VCF. Thanks.

glennhickey commented 1 year ago

The first reference does get special treatment and there will be indeed a bit of bias towards it.

But I'm not quite sure what you're doing and what you're expecting. To get full calls on all samples, I think you will need to run vg call once for each sample. If you are just running vg call once, I think it will just take one sample per bubble.

brettChapman commented 12 months ago

Hi @glennhickey sorry its taken a while to get back to this.

I've been aligning each sample to the graph, then calling variants to produce a VCF for each sample. I then merge the VCF files to produce a single VCF with multiple samples.

For each of the reference/reference-sense paths in the graph, I've split the VCF into different references. You can see that there is a huge difference between different references, with Llute_Wodjil (the first reference) heavily biased:

-rw-rw-r--  1 ubuntu ubuntu   88M Dec  7 03:42 Llute_27849_YL_pangenome.vcf
-rw-rw-r--  1 ubuntu ubuntu   63M Dec  7 03:59 Llute_28213_YL_pangenome.vcf
-rw-rw-r--  1 ubuntu ubuntu  6.9M Dec  7 04:17 Llute_28698_YL_pangenome.vcf
-rw-rw-r--  1 ubuntu ubuntu   29M Dec  7 04:34 Llute_28815_YL_pangenome.vcf
-rw-rw-r--  1 ubuntu ubuntu  2.2M Dec  7 04:52 Llute_29181_YL_pangenome.vcf
-rw-rw-r--  1 ubuntu ubuntu  128M Dec  7 05:09 Llute_Paris_YL_pangenome.vcf
-rw-rw-r--  1 ubuntu ubuntu  8.9G Dec  7 05:38 Llute_Wodjil_YL_pangenome.vcf

Does this mean that variants identified in one reference-sense path (wtih Llute_Wodjil set as the first refeence), are completely unique, and any variants identified in say Llute_Wodjil would be a kind of consensus variant (found in all paths, or only on Llute_Wodjil? I'm trying to understand how these VCF would differ if I were to align to each reference usng conventional approaches instread, such as BWA-MEM followed by GATK. In this way could I consider variants identified to Llute_Wodjil as core pangenome sequence regions, and anything aligning to the other reference-sense paths considered as non-core shell regions found in only a subset. Thanks.

brettChapman commented 11 months ago

Hi @glennhickey

Ideally I would like a multiple sample VCF file. I've realised to really looking at population structure, we'll need a gVCF to get reference alleles in the VCF after I've merged all the VCF together, otherwise there will be missing calls when they should be reference alleles. Is there a way to get vg call to perform a variant call on multiple samples at once or is that for a different tool such as vg genotype?

glennhickey commented 11 months ago

We don't presently have a way to call or genotype multiple samples at once in vg. You'd have to project to BAM then use another tool.

For your previous comment about the different VCF sizes, are you running vg call once for each? Ie vg call -S Llute_Wodjil, vg call -S Llute_Paris etc. ? That would be the only way to get a fair comparison across samples.

If you're doing that and still seeing a big difference, it could be your other samples did not align very well into the graph. You can use vg paths -E (and I think minigraph-cactus outputs some stats files to this effect too) to see how much of each sample actually made it into your final graph.

For the gVCF, if you run call with -a then combine with bcftools merge -0, you should get reference calls in the final output.

brettChapman commented 11 months ago

Hi @glennhickey

Thanks. Adding -a and then merging with -0 is working to bring all the samples together properly.

I was wondering, is it possible to iteratively align WGS reads to the graph until a final graph has all samples within the graph with their own paths. I was considering could this be used in conjunction with ODGI PAV to get presence absence across a huge number of haplotypes and WGS samples aligned? Usually I have used PGGB to generate a graph, ODGI PAV to create a PAV matrix across a relatively small subset, and wondered if aligning hundreds of samples to the graph, could I then get a PAV matrix based on aligned samples as well, which would be much more comprehensive across a population.

Also if it possible to infer a phylogenetic tree from the final pangenome graph with hundreds of samples aligned? I usually run mashtree across all my assembled genomes of pseudoomolecule quality, to get a newick tree, but also wondered if I aligned WGS data to the graph, is there a tool which can interpret phylogeny based on the graph alone? Thanks.

glennhickey commented 11 months ago

The only effective way I"m aware of now to make pangenome graphs from WGS is in one shot from VCF with vg construct.

For what you're doing, I think you'd either need to try to get GAF support added to odgi pav, or work with with VCF (ie from vg call/deconstruct) and do your matrix from that.

brettChapman commented 11 months ago

Thanks. I remember trying to generate a graph from VCF once before but found that my chromosome lengths were too long. I have to use CSI indexing for my genomes. Has this been updated in the latest vg toolkit build?

glennhickey commented 11 months ago

Unfortunately, it doesn't look like it, as the issues remain open.

https://github.com/vcflib/vcflib/issues/367 https://github.com/vgteam/vg/issues/3864

brettChapman commented 11 months ago

Ok thanks. Hopefully its implemented at some point.

I'm currently looking at comparing results between PGGB and MC. Specifically how they compare with identified variants and aligning samples with giraffe. My pangenome in this case is relatively small (7 lupin accessions), and I believe vg autoindex will likely complete, unlike with previous attempts with much larger pangenomes (~20 barley accessions)

The d2.gbz outputs from MC, how are they generated? I assume clipping is done so that nodes have a minimmum depth of 2, to avoid edge cases. Which tools and commands are used to clip out these edge cases?

I'd like to generate similar d2.gbz output and other necessary files using my PGGB graph and vg autoindex.

glennhickey commented 11 months ago

A few things:

brettChapman commented 11 months ago

Hi @glennhickey

Thanks. I've tried generating a .hapl file using those parameters, but I get argument errors

I use --haplo clip --giraffe --clip it complains that --clip needs 1 argument

                         [--badWorker BADWORKER]
                            [--badWorkerFailInterval BADWORKERFAILINTERVAL]
                            --vg VG [VG ...] [--hal HAL [HAL ...]] --outDir
                            OUTDIR --outName OUTNAME --reference REFERENCE
                            [REFERENCE ...] [--clip CLIP] [--filter FILTER]
                            [--gfa [GFA ...]] [--gbz [GBZ ...]]
                            [--odgi [ODGI ...]] [--viz [VIZ ...]]
                            [--draw [DRAW ...]] [--chrom-vg [CHROM_VG ...]]
                            [--chrom-og [CHROM_OG ...]] [--vcf [VCF ...]]
                            [--vcfReference VCFREFERENCE [VCFREFERENCE ...]]
                            [--vcfbub VCFBUB] [--giraffe [GIRAFFE ...]]
                            [--haplo [HAPLO ...]] [--indexCores INDEXCORES]
                            [--indexMemory INDEXMEMORY] [--chop [CHOP]]
                            [--configFile CONFIGFILE] [--latest]
                            [--containerImage CONTAINERIMAGE]
                            [--binariesMode {docker,local,singularity}]
                            jobStore
cactus-graphmap-join: error: argument --clip: expected one argument

I simply reran cactus-graphmap-join from my previous run and output to a different directly than before.

cactus-graphmap-join /cactus/jobStore --indexCores 2 --vg ${original_folder}/lupin-pg/*.vg --hal ${original_folder}/lupin-pg/*.hal --outDir ${original_folder}/lupin-pg_clip --outName lupin-pg_clipped --reference ${REFERENCE} --haplo clip --giraffe --clip clip --chrom-vg --chrom-og --gbz --gfa --viz --draw --disableCaching --workDir=/cactus/workDir --clean always --cleanWorkDir always --defaultDisk 1000G --maxDisk 1000G --maxCores 32 --maxMemory 126G --defaultMemory 126G

Once I get a .hapl file, how do I then go on to create a personalised pangenome? Do I simply convert to a GFA and use with `giraffe and other mapping tools like before? Thanks.

glennhickey commented 11 months ago

Sorry that's a typo on my part. You make the .hapl index by adding --haplo clip --giraffe clip. You use the .hapl index as described here -- see "Giraffe integration (requires vg version 1.51.0)" instructions.

There is also an example of doing all this in this tutorial in this section.

brettChapman commented 11 months ago

Hi @glennhickey

Thanks. From this workflow it looks like for each sample I need to create a personalised pangenome.

Could I create a personalised pangenome from all the reads using kmc or would I have to do kmer counting for each read set prior to mapping the reads? I have around 400 samples, R1 and R2. Thanks.

glennhickey commented 11 months ago

Yeah, the idea is to make a personalized genome for each sample. As shown in that preprint, it doesn't affect the overall running time much: the cost of making the personal pangenome is recovered by faster mapping time.

Not that there's anything preventing you from making a personal genome for a set of samples together -- it's just we haven't really benchmarked this nor looked at the best parameters, and I wouldn't expect it to work as well as the per-sample approach.

brettChapman commented 11 months ago

Thanks. I'll stick to a personalied pangenome per sample.

Is there a way to generate a hapl file from the GBZ file? I've been using a more up to date vg tookit than the one that comes with MC.

I've been getting these errors before with giraffe (below), and read on another issue post that it would be best to generate the min and dist indexes again from the GBZ prior to running giraffe. I'd imagine it would be best to do the same for the hapl files as well.

warning[vg::giraffe]: Refusing to perform too-large rescue alignment of 150 bp against 12776 bp ordered subgraph for read V350045396L1C001R0070424179/1 which would use more than 1572864 cells and might exhaust Dozeu's allocator; suppressing further warnings.
Mapped 28606922 reads across 32 threads in 438.642 seconds with 0.983682 additional single-threaded seconds.
glennhickey commented 11 months ago

You can generally ignore most giraffe warnings, including that one. Cactus includes the latest vg release. .hapl generation is covered in the wiki linked to above.

brettChapman commented 11 months ago

Hi @glennhickey

Just looking over the workflow for the personalised pangenome. Is the aim to use the hapl file and the kmer counts with girraffe with the unfiltered pangenome graph (GBZ). Or do I use giraffe (-Z parameter) with the personalised pangenome graph which is output by a previous run of giraffe?

for example, theres one example which is from manual sampling and another which integrates with giraffe, both seem a bit different. Which do I use? I assume for both I need to run vg haplotypes -v 2 -t 32 -d graph.dist -H graph.hpl graph.gbz before hand.

Manual sampling, temporary graph:

export TMPDIR=/scratch/tmp
kmc -k29 -m128 -okff -t16 -hp sample.fq.gz ${TMPDIR}/sample $TMPDIR
vg haplotypes -v 2 -t 16 --include-reference \
    -i graph.hapl -k ${TMPDIR}/sample.kff -g ${TMPDIR}/sampled.gbz graph.gbz
vg giraffe -p -t 16 -Z ${TMPDIR}/sampled.gbz -i -f sample.fq.gz > sample.gam
Giraffe integration (requires vg version 1.51.0):

export TMPDIR=/scratch/tmp
kmc -k29 -m128 -okff -t16 -hp sample.fq.gz ${TMPDIR}/sample $TMPDIR
vg giraffe -p -t 16 -Z graph.gbz --haplotype-name graph.hapl --kmer-name ${TMPDIR}/sample.kff \
    -N sample -i -f sample.fq.gz > sample.gam

what does the -i parameter do? I cant see it in the usage list.

brettChapman commented 11 months ago

Once I have the resulting GAM file using the HAPL file using the Girraffe integrartion approach above, I want to augment the genome graph with the GAM file to call variants on.

I usually do this:

srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg convert graph.gbz -p > graph.pg

srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg augment graph.pg ${sample}.gam -m 4 -q 5 -Q 5 -A ${sample}.aug.gam > ${sample}.aug.pg

srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg snarls ${sample}.aug.pg > ${sample}.aug.snarls

srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg pack -x ${sample}.aug.pg -g ${sample}.aug.gam -o ${sample}.aug.pack

reference_paths=$(srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg paths -x ${sample}.aug.pg -L | cut -d '[' -f 1 | sort | uniq | tr '\n' ' ' | sed 's/ / -p /g' | sed 's/ -p $/\n/g')

srun -n 1 singularity exec --bind ${PWD}:${PWD} ${VG_IMAGE} vg call ${sample}.aug.pg -r ${sample}.aug.snarls -k ${sample}.aug.pack -a -p ${reference_paths} -s ${sample} > ${sample}.aug.vcf

srun -n 1 singularity exec --bind ${PWD}:${PWD} ${TABIX_IMAGE} bgzip -@ 16 ${sample}.aug.vcf
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${TABIX_IMAGE} tabix ${sample}.aug.vcf.gz

srun -n 1 singularity exec --bind ${PWD}:${PWD} ${BCFTOOLS_IMAGE} bcftools view -f PASS  ${sample}.aug.vcf.gz > ${sample}.aug.final.vcf

srun -n 1 singularity exec --bind ${PWD}:${PWD} ${TABIX_IMAGE} bgzip -@ 16 ${sample}.aug.final.vcf

srun -n 1 singularity exec --bind ${PWD}:${PWD} ${TABIX_IMAGE} tabix ${sample}.aug.final.vcf.gz

Should the GBZ file when augmenting with the GAM file, be a personalised GBZ file or the unfiltered GBZ file? If the personalised GBZ file, then I should take the manual sampling approach to create a temporary graph ${sample}.gbz, instead of the Giraffe integration example above?

brettChapman commented 9 months ago

Hi @glennhickey I've been getting an error about r-index when running giraffe:

Loading GBZ from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.gbz
Generating haplotype information
Loading distance index from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.dist
Building minimizer index
Built the minimizer index in 109.653 seconds
Guessing that r-index is /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.ri
Loading r-index from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.ri
error[VPKG::load_one]: Could not open /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.ri while loading gbwt::FastLocate
srun: error: node-5: task 0: Exited with exit code 1

I run this code which gives me the error, when trying to generate the HAPL file:

vg haplotypes -v 2 -t 32 -d ${DIST_file} -H ${HAPL_file} ${GBZ_file}

I need the HAPL file so I can run:

vg giraffe -p -t 32 -Z ${GBZ_file} --haplotype-name ${HAPL_file} --kff-name ${sample}.kff -N ${sample} -f ${sample}_1.trimmed.fq -f ${sample}_2.trimmed.fq > ${sample}.gam
jltsiren commented 9 months ago

It looks like you don't have the r-index file. You can build it like this:

vg gbwt -p --num-threads 16 -r graph.ri -Z graph.gbz
brettChapman commented 9 months ago

Thanks @jltsiren

brettChapman commented 9 months ago

Hi @glennhickey and @jltsiren

I've managed to get further along, but now get this error:

Loading GBZ from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.gbz
Generating haplotype information
Loading distance index from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.dist
Building minimizer index
Built the minimizer index in 87.2338 seconds
Guessing that r-index is /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.ri
Loading r-index from /data/lupin_pangenome/minigraph-cactus_run/lupin-pg/lupin-pg.ri
Determining construction jobs
error: [vg haplotypes] HaplotypePartitioner::partition_haplotypes(): there are 49 top-level chains and 26 weakly connected components; haplotype sampling cannot be used with this graph

This is my workflow up untll giraffe:

vg index ${GBZ_file} -j ${DIST_file}
vg gbwt -p --num-threads 32 -r ${RINDEX_file} -Z ${GBZ_file}
vg haplotypes -v 2 -t 32 -d ${DIST_file} -H ${HAPL_file} ${GBZ_file}

kmc -k29 -m128 -okff -t32 @${sample}.reads.txt ${sample} ${PWD}/kmc_tmp/
vg giraffe -p -t 32 -Z ${GBZ_file} --haplotype-name ${HAPL_file} --kff-name ${sample}.kff -N ${sample} -f ${sample}_1.trimmed.fq -f ${sample}_2.trimmed.fq > ${sample}.gam
jltsiren commented 9 months ago

Haplotype sampling expects a graph with a linear high-level structure. The error message indicates that your graph lacks that structure. There are 26 weakly connected components, which should correspond to chromosomes. On the other hand, there are 49 top-level chains, which are the linear structures haplotype sampling uses for generating haplotypes. Because the numbers don't match, many of the internal assumptions don't hold and the algorithm doesn't work.

If you want to use haplotype sampling, you need to rebuild the graph so that the chromosomes / other contigs are clearly separated and it's possible to find end-to-end paths over each contig.

brettChapman commented 9 months ago

Hi @jltsiren

Thanks. I have 26 chromosomes for each of the 7 genomes. There are no contigs, only 26 chromosomes for each of the 7. I ran all of these together with minigraph-cactus, and then ran the join method at the end which merges all chromosomes together. Perhaps I should run the mapping stage on each chromosome manually first and then join afterwards. I've done this for much larger pangenomes, but because this genome is much smaller, I ran it all together.

If I rebuild the graph, what exactly should I be changing? The only thing I can see which could change is the join stage:

srun -n 1 singularity exec --cleanenv \
                       --no-home \
                       --overlay ${JOBSTORE_IMAGE} \
                       --bind ${CACTUS_SCRATCH}/tmp:/tmp \
                       ${CACTUS_IMAGE} cactus-graphmap-join /cactus/jobStore --indexCores 2 --vg ${original_folder}/lupin-pg/*_Chr*.vg --hal ${original_folder}/lupin-pg/*_Chr*.hal --outDir ${original_folder}/lupin-pg --outName lupin-pg --reference ${REFERENCE} --giraffe --chrom-vg --chrom-og --gbz --gfa --viz --draw --disableCaching --workDir=/cactus/workDir --clean always --cleanWorkDir always --defaultDisk 1000G --maxDisk 1000G --maxCores 32 --maxMemory 126G --defaultMemory 126G
brettChapman commented 9 months ago

Also worth noting, --reference I have each path in the graph as a reference, so when calling variants, I get variant calls against each path, and not just one of the genomes.

glennhickey commented 9 months ago

The output of minigraph-cactus should always be compatible with vg hapltoypes. So I think this is either a bug or you are using an old version of Cactus. Which version are you using? Note that there is also the --haplo option to have mc make the haplotypes index for you.

brettChapman commented 9 months ago

Hi @glennhickey

The version I have is from November last year.

I'll try with the --haplo parameter. If all else fails, I'll update to the latest version and rerun from the beginning.

brettChapman commented 9 months ago

Hi @glennhickey

Should I use --haplo clip --giraffe clip if I'm planning on doing haplotype sampling? or --haplo full --giraffe full? or doesn't it matter?

glennhickey commented 9 months ago

You want to use clip. I just added a bit of documentation here: https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md#haplotype-sampling-instead-of-filtering-new