iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

getting position of variants with respect to the genome #279

Open smb20200615 opened 3 years ago

smb20200615 commented 3 years ago

Hello,

The vcf that is outputted by pandora gives variant positions with respect to each individual gene. Is it possible to map the position of variants with respect to the pangenome reference?

Many thanks!

smb20200615 commented 3 years ago

Sorry to build on that question. Is there anyway we can map positions of variants with respect to a publicly available reference genome?

mbhall88 commented 3 years ago

You can only "normalise" the variant positions in the VCF to a reference genome if the --vcf-ref argument was used.

Here is an example of a script I use to do this https://github.com/mbhall88/head_to_head_pipeline/blob/master/analysis/pandora_variants/scripts/normalise_pos.py

smb20200615 commented 2 years ago

It seems like a --vcf-ref is not an ordinary fasta file. So I cant use one of my genomes as a reference?

A fasta file with an entry for each LocalPRG giving reference sequence for VCF. Must have a perfect match in the graph and the same name as the graph --illumina. Data is from illumina rather than nanopore, so is shorter with low error rate

I am interested in removing recombinant SNPs so being able to map these variants against one of my illumina genomes and plot the SNP density would be useful. Is there any other way to filter recombinant SNPs?

mbhall88 commented 2 years ago

It seems like a --vcf-ref is not an ordinary fasta file. So I cant use one of my genomes as a reference?

What are your PRGs? Genes? If so, take the reference sequence for each of those genes and put them into a file and this is your vcf-ref. Then, with the resulting pandora VCF you can easily adjust the VCF POS to the overall genome position. i.e., take the start position of the gene in the reference genome and add that to the VCF POS (which is relative to the gene).

Is there any other way to filter recombinant SNPs?

Excuse my naivety, I'm not super clear on what a recombinant SNP is. I'm going to assume it's analogous to a heterozygous SNP? If so, you can try filter positions without a certain percentage of median (or mean) coverage support for the called allele. So, if we require a fraction of read support (FRS) of 90%, for a position that has called the reference allele has coverage 8 on the ref and the alt has 2, the FRS for that position is 0.8 and thus filtered out.

Sorry if this is not what you're asking about and I have just rambled....

iqbal-lab commented 2 years ago

This is a limitation of Pandora currently, it doesn't tell you gene order. @mbhall88 what they are after is this

Normally if you want to build a tree, you want a vcf of snps that appeared in a clock-like manner over evolutionary time . This allows you to get correct branch lengths. If there has been a recombination event , switching in 100kb say, that new chunk has a different history, and typically you get too many snps on that region. So, people scan the genome, look for sudden peaks of snp density, and mask those out. (I'm being a bit cavalier, check out gubbins or clonalFrameML.)

So, currently you can't do this with Pandora output as you don't know the layout of the genes on the genome.

One simple hack would be to take the vcfref, which has one read per gene, and map it to the assemblies of all your genomes. That would tell you where the genes are on each assembly, and you could then scan 5kb chunks of each assembly and look for peaks. It's pretty hacky though. Hope this made sense.

smb20200615 commented 2 years ago

Yes, that is exactly my goal @iqbal-lab! Thank you for clarifying the issue about gene order. Do you foresee any issues with using tools such as Mauve to get the gene order for each isolate after normalizing the positions against the pangenome reference for that isolate?