vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Can the Vg software realize the combination of the graphic pan-genome and the vcf file obtained by GATK #3877

Open abcyulongwang opened 1 year ago

abcyulongwang commented 1 year ago

Hello! I have obtained three graph pan-genomes through three software: Minigraph, Minigraph_cactus and Pggb, which contain 12 mammalian species. Then I also have nearly 500 next-generation sequencing data of this species. What I want to know is whether I can directly use one of the above three software as the input result, such as using “primates-pg.vcf.gz” or “primates.gfa.gz” file. If so, which software do you recommend for the results? Secondly, I used GATK to get the vcf comparison results of the next-generation sequencing data. I wonder if the vg software can combine the vcf files with the graphic pan-genome? Best yulong

jeizenga commented 1 year ago

For short read mapping, you'll probably have the most luck with the Minigraph-Cactus graph. Minigraph alone doesn't provide detailed base-resolution alignments, and PGGB graphs have more complicated topologies that we haven't found robust methods to index yet.

I'm not aware of any mature methods that merge variants from a VCF into an existing pangenome graph.

abcyulongwang commented 1 year ago

Thank you very much for your reply, your suggestion coincides with our approach.! But I ran into a new problem. Our "primates-pg.vcf.gz" size is about 930Mb, and the reference genome size is 2.5Gb. We ran "vg autoindex --workflow mpmap" to build the index, for pan-transcriptome construction.

image

But it seems that it has been running for two days. I am wondering if you suggest that we separate the chromosomes to build the index separately. If so, how can I merge the pan-genome index files of different chromosomes at the end?

Best yulong

jeizenga commented 1 year ago

It's common that this is the longest step of the vg map and vg mpmap indexing pipelines. I don't think we have tooling to combine separate GCSA indexes though. Regardless, the existing code is already heavily multithreaded, so it wouldn't help to split this step of the indexing across multiple processes. You could only conceivable speed it up by running it over multiple nodes.

In my experience, GCSA indexing is often bottlenecked by disk IO. The indexing algorithm does several repeated disk-backed steps that can generate 10-100s of GB of IO each. If your compute environment has slow IO, it might be worthwhile to try to execute in a different one (e.g. one with solid state rather than disk drives).

At 2 days, that's getting long, but I would probably recommend being patient a little longer. If you want, you can try running with --verbosity 2, which will trigger some more logging to stderr from GCSA indexing algorithm. That might give you a better sense of whether it's making progress or not.

abcyulongwang commented 1 year ago

Thanks for your reply, I will take your suggestion into consideration!