sbslee / pypgx

A Python package for pharmacogenomics (PGx) research
https://pypgx.readthedocs.io
MIT License
66 stars 13 forks source link

SLCO1B1 #103

Closed vaipathak closed 1 year ago

vaipathak commented 1 year ago

A previous issue of mine had BEAGLE overwriting existing genotypes if samples had missing genotype calls, which caused the wrong genotypes to be called. So for example - I had with a genotype showing up more than the 1 / 1. In the gene UGT1A1, PyPgx calls the diplotype 1 / 80 in close to 45% of the samples we have. Looking at the UGT1A1 frequency table from CPIC, the population with largest 80 allele has an allele frequency of 0.4503 , though the diplotype of 1 / *80 has a frequency of 0.0278 in that same population (African American). The next highest was European with an diplotype frequency of 0.22. So this was the main cause of concern as to why we were experiencing such a high call rate. Now this issue was fixed for most of the genes with phenotypes by imputing (./.) missing calls with (0/0).

While all other genes show allele frequency phenotypes that make sense, gene SLCO1B1 still seems to be misbehaving where we still see a non-normal phenotype being called within a population more often than a normal one as seen in the image below: SLCO1B1 Barplot

Here is the pipeline I have running so far.


##Change` (./.) to (0/0)
bcftools +missing2ref GermlineOriginal_Annot.vcf.gz -Oz -o GermlineOriginal_Annot_missing2ref.vcf.gz

##Grab the genotype only to avoid errors and crashes that occurred previously
bcftools annotate -x '^FORMAT/GT' GermlineOriginal_Annot_missing2ref.vcf.gz -Oz -o GermlineOriginal_Annot_missing2refGT.vcf.gz

##Index vcf
bcftools index -t GermlineOriginal_Annot_missing2refGT.vcf.gz

##Run looped pypgx -> Currently running PyPGx 0.19.0 and fuc 0.36.0

module load python/3.9.5 
#CPIC genes only
declare -a StringArray=("ABCG2" "CYP2C9" "CACNA1S" "CFTR" "VKORC1" "CYP2B6" "CYP2C19" "CYP2D6" "CYP3A5" "NUDT15" "TPMT" "DPYD" "UGT1A1" "SLCO1B1")

# Iterate the string array using for loop
for val in ${StringArray[@]}; do

#Running Filtered 
pypgx run-ngs-pipeline $val $val-pipeline --variants GermlineOriginal_Annot_missing2refGT.vcf.gz --force --do-not-plot-allele-fraction --do-not-plot-copy-number &
#Dr. Lee mentioned to use the option --force --do-not-plot-allele-fraction to avoid a crash - should be fixed in version 0.20.0
done
sbslee commented 1 year ago

Hello @vaipathak,

Thanks for this very detailed post and also for sending the imported-variants.zip file.

First of all, did you modify the file before sending it to me? It kept throwing an error and later it turned out that there was an offending line at the position 12-21196253. It looked like the file has been modified in a text editor. I simply removed the offending line for my testing, but I just wanted to confirm.

Because the original file contained too many samples for my testing (N=10,389), I created a mini-VCF by taking the first 1,000 samples and ran PyPGx on it. Here's the breakdown of predicted phenotypes:

Predicted phenotype Count
Normal Function 659
Decreased Function 242
Indeterminate 25
Increased Function 41
Poor Function 28
Possible Decreased Function 5

As you can see, at least in my testing, Normal Function is the dominant phenotype as expected.

Next, just to be sure, I looked at the distribution of individual diplotypes, like the plot you attached: diplotypes.csv. Again, I saw way more *1/*1 than other diplotypes such as *1/*15. Then I realized that in your plot, *1/*15 is displayed as *1A/*15, which was indicative of the fact that you are probably using an older version of PyPGx (before 0.18.0 to be more specific; see the CHANGELOG for details and also this PR). This is because, starting with the PharmVar version v5.2.1, SLCO1B1 was formally added to PharmVar in 2021, which resulted in many changes including star allele definitions such as *1A*1. Please also refer to the publication PharmVar GeneFocus: SLCO1B1 by Ramsey et al., 2022 (BTW, I'm a co-author of this paper).

In short, you mentioned that you are using the 0.19.0 version of PyPGx, but the fact that you are getting alleles such as *1A tells me that you may not. Please double check that you are using the latest version (it's 0.21.0). Also, if you are indeed using the latest version, please also make sure weather you haven't (accidentally) modified the allele table (i.e. allele-table.csv) and that's why causing the issue.

vaipathak commented 1 year ago

Hi Dr. Lee,

This is quite helpful, thank you for your own detailed explanation. Yes a few of the samples had some sensitive information so I just changed them all, I may have added an extra line, my apologies. According to pypgx --version I do get 0.19.0. However it sounds like the best recourse in this case is to update to the latest version of PyPGx and then re-run SLCO1B1. Once I have that installed and run, I can let you know how it goes.

Thanks again for all your help!

vaipathak commented 1 year ago

Good news, it looks like just running with the most updated version fixed the issue - it lines up with the same values that you got as well. Thanks again for the help and support! 😊