Open AmayAgrawal opened 1 year ago
Hi there .
Thanks for the explanation regarding the presence of '.' in the GT. Now, it's a bit more clear to me. If I understand correctly, then I can just count those cases as SNP's not present in the sample, because for my downstream analysis, I need that information whether SNP is present or absent in a sample.
Further, I tried to use pandora compare
to call out the variants in multiple samples using the default (ML path-oriented (global) approach) as well as --local
option (coverage-oriented (local) genotyping) and then compared the results (VCF files) generated using both the approaches. Here, I observed that for some SNP's that are actuallty present in the sample (checked it by aligning the gene against the reference sequence from pandora_multisample.vcf_ref file that was used by pandora for that gene for genotyping), the --local
option contained the non 0 genotype for such SNP's, while in the default (global) approach, there was still '.' for such SNP's even though the genotype confidence was around 4 and the SNP's were present in the sample. There were some cases other way around also, where the global approach called out the already present SNP's, but in local approach there was '.' in the genotype column, but these cases were very few. So, I am still a bit confused as to which approach should I use for genotyping, the default (global) approach or the local approach?
I would recommend the default, it is default for a good reason. Local was our first attempt, but it is hard to ensure your local genotypibg results in globally consistent genotypes
If you're happy to share data, @leoisl and I would be interested to see examples where one approach worked and the other did not!
Hello, sorry, I was on holidays and couldn't answer before. For point 2, I have nothing else to add besides what @iqbal-lab already said. For point 1, I can develop a bit further:
- In the default global approach, I don't have any '1' in the genotype (GT) column (there is always either '.' or '0' infront of each SNP) even though the genotype confidence was around 4. While for the some of the same SNP's which has '.' in genotype column in global approach, we have '1' in genotype column using the local genotyping.
This is indeed the intended behaviour. Complementing what @iqbal-lab said in previous answers, by using the default global approach, pandora
chooses the maximum likelihood (ML) path in the graph (this is called the global choice) and then genotypes each variant in the ML path (this is called the local choice). Genotyping each variant is a local choice because we just look at a specific variant and the coverage data we have and choose an allele to genotype. In most cases, the local genotyping will genotype towards allele 0
, i.e. the ref, the allele in the ML path. This means that both the global and local genotyping agrees on their genotyping, and we genotype towards allele 0
. However, in some cases, the local genotyping chooses another allele. In this case, the global and local genotyping disagrees, and then we prefer to nullify the call (call .
). If you instead use the local approach, then pandora
will favour the local genotyping, even if that means changing the global genotyping choice, i.e. with the local approach, pandora
is free to change the ML path to make it compatible with the local genotyping. I think this is a very general explanation of both methods, and might be too abstract and not straightforward to understand, I can develop further on this if needed.
We set global genotyping as default because that is what worked better in our data for the pandora
paper, which is an E. coli dataset. However, we are also seeing better results with local genotyping in a TB dataset. I think that a good rule of thumb, although I don't have much data to back this up, is that if you expect a considerable amount of long variants and if your core genome is small (like in E. coli), the global genotyping approach will work better. Otherwise, if you just expect small variants, and your core genome is large (like in TB), the local genotyping will work better.
Hi @leoisl,
Thanks for a nice and detailed explanation. The difference between the two approaches is a bit clear to me now. It would be great if there is any documentation about both the approaches where I can read about them in more detail
Anyways, as in your last comment you suggested that for E.coli dataset that was used in the paper, the global approach worked better over the local approach. I am also using the E.coli data in my work and I have more than 1800 samples. So, I was trying to use pandora compare
to call out the variants in these strains. As @iqbal-lab mentioned, I created a small example where I used pandora compare for two strains and compared the results using both the global and local approach. I am attaching the zip file containing all the necessary data if you want to look at the results (pandora.zip). Here, I observed that for SNP's, local approach performed better as compared to global approach (as local approach was able to detect them but global approach could not), but incase of Phased_SNPS and INDELs, global approach performs better as compared to local approach
I am using
pandora map
to align the reads against the pangenome reference and call out the variants using the--genotype
option. I ran the pandora map command using the default (ML path-oriented (global) approach) as well as--local
option (coverage-oriented (local) genotyping) to call out the variants and then compared the results (VCF files) generated using both the approaches. Here I am facing two issues:1) In the default global approach, I don't have any '1' in the genotype (GT) column (there is always either '.' or '0' infront of each SNP) even though the genotype confidence was around 4. While for the some of the same SNP's which has '.' in genotype column in global approach, we have '1' in genotype column using the local genotyping. So I went back and check the alignment for that gene and I can see that infact there is a actaully a SNP present at the location where the local approach says it is present, but the global approach says it is '.' So I am not sure which approach to use while calling out the variants, either global (which is default) or the local approach?
2) Another question is regarding the '.' which is present in the genotype column in the VCF file. Here, I think that genotype confidence (GT_CONF) required to make a call is low and this is why software is not able to make a call whether SNP is present or absent in the sample, because by default the minimum GT_CONF required to make a call is 1. So I changed the minimum GT_CONF required to make a call to 0 by setting the parameter
----gt-conf
to 0, but still there are lot of SNP's where we still have '.' in the genotype column. So, I was wondering that if there is any way somehow, so that we can get only '0' or '1' in the genotype column, which suggests that the SNP is either absent or present in that sample respectively. My ultimate aim is to usepandora compare
to call out the variants in the multiple samples, but when I ran that using the default global approach, I observed that there were lot of '.' in the genotype column (GT_CONF was 0). For some of these '.' that I checked randomly, I can see that the SNP is actually either present or absent in the sample but the software still outputs '.' instead of '0' or '1'.It would be great if someone can clear this doubts for me.
P.S: I am still using pandora version 0.9.1 (although 0.9.2 is available now) because this is still the latest version available with conda install.