iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Issue while choosing the reference path for genotyping #329

Open AmayAgrawal opened 1 year ago

AmayAgrawal commented 1 year ago

Hi,

I am facing an issue regarding the reference path that pandora uses for genotyping the variants. It is basically using the less frequent supported path instead of most frequent supported path as a reference. Below I will try to explain it in a simple way:

Suppose I am using 100 strains for my analysis. First, I did the pan-geome analysis and use the MSA's to build the pan-genome reference graphs (PRG). Next, used these PRG's to genotype the variants in these 100 strains using pandora. Now suppose for a pan-genome graph of a particular loci (let's say gene A) at a particular position (let's say 300), we have 3 differents paths that are possible. Among these 3 paths, If I understand correctly, the path which is supported by majority strains out of 100 strains should be chosen as reference, but actually it was not the case. Due to this, suppose the SNP which I was looking for (let's say C 300 T), in which 'C' is ref and 'T' is alt allele, actually pandora chooses 'T' as ref and 'C' as alt allele. I saw in one of the issues that is currently open that Pandora heavily undermappes (#325). Can it the be the case that it is choosing less frequent path due to this or maybe I am understanding something incorrectly?

iqbal-lab commented 1 year ago
  1. yes, this is possible. Pandora needs to make a "global" choice, of a path from one end of the gene to the other. Sometimes the data is such that there are lots of reads forcing a path one way across the graph, and this takes a path "a long way away vertically" from a bubble deep in the graph, where there is a lot of coverage for one allele. If there is no way to make a single path consistent with all of that, it does what it can based on dynamic programming.

Suppose the MSA looks like xxxxxAxxxxxx xxxxxCxxxxx xxyyyyyyyyxx If there is very low coverage on the x's and lots on the y, you get forced onto the bottom path, and the A/C choice becomes irrelevant/ignored.

  1. It's hard to comment more without concrete data; i expect it's not pandora undermapping, but can't tell Would you like to share more details?
AmayAgrawal commented 1 year ago

Hi, I have uploaded a zip folder at this drive link (https://nubes.helmholtz-berlin.de/s/R8SHBsT8yDmeca4) which contains all the necessary files required to regenerate the issue that I am talking about. This zip folder contains a 'README' file, which explains all the steps and files that are present in this zip folder.

Let me know if you have any more questions from my side

iqbal-lab commented 6 months ago

Omg we have not replied to you! So sorry @AmayAgrawal , we will return to this after the Xmas vacation

AmayAgrawal commented 6 months ago

No worries. It would be nice if you can look at this now