Closed vaipathak closed 1 year ago
Hello @vaipathak,
Thanks for this very detailed post and also for sending the imported-variants.zip
file.
First of all, did you modify the file before sending it to me? It kept throwing an error and later it turned out that there was an offending line at the position 12-21196253
. It looked like the file has been modified in a text editor. I simply removed the offending line for my testing, but I just wanted to confirm.
Because the original file contained too many samples for my testing (N=10,389), I created a mini-VCF by taking the first 1,000 samples and ran PyPGx on it. Here's the breakdown of predicted phenotypes:
Predicted phenotype | Count |
---|---|
Normal Function | 659 |
Decreased Function | 242 |
Indeterminate | 25 |
Increased Function | 41 |
Poor Function | 28 |
Possible Decreased Function | 5 |
As you can see, at least in my testing, Normal Function
is the dominant phenotype as expected.
Next, just to be sure, I looked at the distribution of individual diplotypes, like the plot you attached: diplotypes.csv. Again, I saw way more *1/*1
than other diplotypes such as *1/*15
. Then I realized that in your plot, *1/*15
is displayed as *1A/*15
, which was indicative of the fact that you are probably using an older version of PyPGx (before 0.18.0 to be more specific; see the CHANGELOG for details and also this PR). This is because, starting with the PharmVar version v5.2.1, SLCO1B1 was formally added to PharmVar in 2021, which resulted in many changes including star allele definitions such as *1A
→ *1
. Please also refer to the publication PharmVar GeneFocus: SLCO1B1 by Ramsey et al., 2022 (BTW, I'm a co-author of this paper).
In short, you mentioned that you are using the 0.19.0
version of PyPGx, but the fact that you are getting alleles such as *1A
tells me that you may not. Please double check that you are using the latest version (it's 0.21.0
). Also, if you are indeed using the latest version, please also make sure weather you haven't (accidentally) modified the allele table (i.e. allele-table.csv
) and that's why causing the issue.
Hi Dr. Lee,
This is quite helpful, thank you for your own detailed explanation. Yes a few of the samples had some sensitive information so I just changed them all, I may have added an extra line, my apologies. According to pypgx --version
I do get 0.19.0. However it sounds like the best recourse in this case is to update to the latest version of PyPGx and then re-run SLCO1B1. Once I have that installed and run, I can let you know how it goes.
Thanks again for all your help!
Good news, it looks like just running with the most updated version fixed the issue - it lines up with the same values that you got as well. Thanks again for the help and support! 😊
A previous issue of mine had BEAGLE overwriting existing genotypes if samples had missing genotype calls, which caused the wrong genotypes to be called. So for example - I had with a genotype showing up more than the 1 / 1. In the gene UGT1A1, PyPgx calls the diplotype 1 / 80 in close to 45% of the samples we have. Looking at the UGT1A1 frequency table from CPIC, the population with largest 80 allele has an allele frequency of 0.4503 , though the diplotype of 1 / *80 has a frequency of 0.0278 in that same population (African American). The next highest was European with an diplotype frequency of 0.22. So this was the main cause of concern as to why we were experiencing such a high call rate. Now this issue was fixed for most of the genes with phenotypes by imputing (./.) missing calls with (0/0).
While all other genes show allele frequency phenotypes that make sense, gene SLCO1B1 still seems to be misbehaving where we still see a non-normal phenotype being called within a population more often than a normal one as seen in the image below:
Here is the pipeline I have running so far.