Vcf output - Githubissues

khodor14 commented 2 years ago

If I'm building a graph of n genomes of E.coli, is there an option in pandora to get the vcf of these genomes?

leoisl commented 2 years ago

Hello Khodor,

Yes. I will use Fig 2 of the pandora paper as a reference for this comment:

Firstly, you need to split your genome into several loci (Fig 2A), do a multiple sequence alignment and build the PRG (population reference graph) for each locus (Fig 2B). Once you built the PRG for all loci, you will have a Pangenome Reference graph (PanRG, Fig 2C). Then you can use pandora index to index this PanRG, and pandora compare to call variants from sample reads to the indexed PanRG (Fig 2D).

To split your genome in several loci (the first step), I would recommend panaroo to find genes and piggy to find intergenic regions. In a branch of a yet unmerged fork, you can find a containerized snakemake pipeline that will do all this for you: https://github.com/leoisl/make_prg/tree/assemblies_to_PanRG/scripts/assemblies_to_PanRG . You just need to specify a dir with your assemblies, and it will run everything you need and create MSAs split by genes and intergenic regions in the output directory (see config.yaml and the Snakefile in that dir). Then you need run make_prg on these loci and then pandora. We will soon have this script merged into this repo.

cheers

mbhall88 commented 2 years ago

You also need to provide the --genotype option if you want a genotyped VCF (see here)

khodor14 commented 2 years ago

This is when I want to map reads or to compare reads against an already indexed PRG. But I want is different, I want to build a vcf for E coli against a reference genomes. I might also build different VCFs against different references.

I think this is cannot be done through pandora. Anyway, thank you all for the detailed description.

leoisl commented 1 year ago

Closing due to inactivity, feel free to reopen.

iqbal-lab-org / make_prg

Vcf output #31