iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
110 stars 14 forks source link

Understanding Genotype output in the VCF file generated using Pandora map and compare #317

Open AmayAgrawal opened 1 year ago

AmayAgrawal commented 1 year ago

I am using pandora map to align the reads against the pangenome reference and call out the variants using the --genotype option. I ran the pandora map command using the default (ML path-oriented (global) approach) as well as --local option (coverage-oriented (local) genotyping) to call out the variants and then compared the results (VCF files) generated using both the approaches. Here I am facing two issues:

1) In the default global approach, I don't have any '1' in the genotype (GT) column (there is always either '.' or '0' infront of each SNP) even though the genotype confidence was around 4. While for the some of the same SNP's which has '.' in genotype column in global approach, we have '1' in genotype column using the local genotyping. So I went back and check the alignment for that gene and I can see that infact there is a actaully a SNP present at the location where the local approach says it is present, but the global approach says it is '.' So I am not sure which approach to use while calling out the variants, either global (which is default) or the local approach?

2) Another question is regarding the '.' which is present in the genotype column in the VCF file. Here, I think that genotype confidence (GT_CONF) required to make a call is low and this is why software is not able to make a call whether SNP is present or absent in the sample, because by default the minimum GT_CONF required to make a call is 1. So I changed the minimum GT_CONF required to make a call to 0 by setting the parameter ----gt-conf to 0, but still there are lot of SNP's where we still have '.' in the genotype column. So, I was wondering that if there is any way somehow, so that we can get only '0' or '1' in the genotype column, which suggests that the SNP is either absent or present in that sample respectively. My ultimate aim is to use pandora compare to call out the variants in the multiple samples, but when I ran that using the default global approach, I observed that there were lot of '.' in the genotype column (GT_CONF was 0). For some of these '.' that I checked randomly, I can see that the SNP is actually either present or absent in the sample but the software still outputs '.' instead of '0' or '1'.

It would be great if someone can clear this doubts for me.

P.S: I am still using pandora version 0.9.1 (although 0.9.2 is available now) because this is still the latest version available with conda install.

iqbal-lab commented 1 year ago

Hi there .

  1. If you run Pandora with just one sample and do not give it a reference genome to use in its vcf, it will map the reads and infer a personalised reference to use in the vcf which is as close as possible to the sample. So most genotype entries will be 0. If you use 'Pandora compare' to compare multiple samples, it will choose a single fixed reference per gene to use in the vcf. At that point you'll have many non 0 genotypes
  2. Suppose a gene has two very divergent alleles/haplotypes, each with snps. If a sample has basically the first haplotype with a few snp differences, then all the snps on the second haplotype will be dot . . Does this make sense? Basically the vcf contains all the snps in the graph, but not all snps in the graph lie in the reference, so those which do not get dot genotype
AmayAgrawal commented 1 year ago

Thanks for the explanation regarding the presence of '.' in the GT. Now, it's a bit more clear to me. If I understand correctly, then I can just count those cases as SNP's not present in the sample, because for my downstream analysis, I need that information whether SNP is present or absent in a sample.

Further, I tried to use pandora compare to call out the variants in multiple samples using the default (ML path-oriented (global) approach) as well as --local option (coverage-oriented (local) genotyping) and then compared the results (VCF files) generated using both the approaches. Here, I observed that for some SNP's that are actuallty present in the sample (checked it by aligning the gene against the reference sequence from pandora_multisample.vcf_ref file that was used by pandora for that gene for genotyping), the --local option contained the non 0 genotype for such SNP's, while in the default (global) approach, there was still '.' for such SNP's even though the genotype confidence was around 4 and the SNP's were present in the sample. There were some cases other way around also, where the global approach called out the already present SNP's, but in local approach there was '.' in the genotype column, but these cases were very few. So, I am still a bit confused as to which approach should I use for genotyping, the default (global) approach or the local approach?

iqbal-lab commented 1 year ago

I would recommend the default, it is default for a good reason. Local was our first attempt, but it is hard to ensure your local genotypibg results in globally consistent genotypes

iqbal-lab commented 1 year ago

If you're happy to share data, @leoisl and I would be interested to see examples where one approach worked and the other did not!

leoisl commented 1 year ago

Hello, sorry, I was on holidays and couldn't answer before. For point 2, I have nothing else to add besides what @iqbal-lab already said. For point 1, I can develop a bit further:

  1. In the default global approach, I don't have any '1' in the genotype (GT) column (there is always either '.' or '0' infront of each SNP) even though the genotype confidence was around 4. While for the some of the same SNP's which has '.' in genotype column in global approach, we have '1' in genotype column using the local genotyping.

This is indeed the intended behaviour. Complementing what @iqbal-lab said in previous answers, by using the default global approach, pandora chooses the maximum likelihood (ML) path in the graph (this is called the global choice) and then genotypes each variant in the ML path (this is called the local choice). Genotyping each variant is a local choice because we just look at a specific variant and the coverage data we have and choose an allele to genotype. In most cases, the local genotyping will genotype towards allele 0, i.e. the ref, the allele in the ML path. This means that both the global and local genotyping agrees on their genotyping, and we genotype towards allele 0. However, in some cases, the local genotyping chooses another allele. In this case, the global and local genotyping disagrees, and then we prefer to nullify the call (call .). If you instead use the local approach, then pandora will favour the local genotyping, even if that means changing the global genotyping choice, i.e. with the local approach, pandora is free to change the ML path to make it compatible with the local genotyping. I think this is a very general explanation of both methods, and might be too abstract and not straightforward to understand, I can develop further on this if needed.

We set global genotyping as default because that is what worked better in our data for the pandora paper, which is an E. coli dataset. However, we are also seeing better results with local genotyping in a TB dataset. I think that a good rule of thumb, although I don't have much data to back this up, is that if you expect a considerable amount of long variants and if your core genome is small (like in E. coli), the global genotyping approach will work better. Otherwise, if you just expect small variants, and your core genome is large (like in TB), the local genotyping will work better.

AmayAgrawal commented 1 year ago

Hi @leoisl,

Thanks for a nice and detailed explanation. The difference between the two approaches is a bit clear to me now. It would be great if there is any documentation about both the approaches where I can read about them in more detail

Anyways, as in your last comment you suggested that for E.coli dataset that was used in the paper, the global approach worked better over the local approach. I am also using the E.coli data in my work and I have more than 1800 samples. So, I was trying to use pandora compare to call out the variants in these strains. As @iqbal-lab mentioned, I created a small example where I used pandora compare for two strains and compared the results using both the global and local approach. I am attaching the zip file containing all the necessary data if you want to look at the results (pandora.zip). Here, I observed that for SNP's, local approach performed better as compared to global approach (as local approach was able to detect them but global approach could not), but incase of Phased_SNPS and INDELs, global approach performs better as compared to local approach