sbslee / pypgx

A Python package for pharmacogenomics (PGx) research
https://pypgx.readthedocs.io
MIT License
65 stars 12 forks source link

Regarding RefAllele and Default Allele in gene-table.csv #51

Closed NTNguyen13 closed 2 years ago

NTNguyen13 commented 2 years ago

Hi, this question is mostly for better understanding of pypgx data structure.

I tried to figure out the meaning of RefAllele, but it's not quite right actually. I thought RefAllele is the allele represented on the Human Reference genome (the fasta file), but GRCh37Default and GRCh38Default are already represented that. I also saw case where GRCh37Default and GRCh38Default flip (I think it's because of changes between GRCh37 and GRCh38), but I found 5 cases where GRCh37Default and GRCh38Default are the same, but they are different from RefAllele

Gene    RefAllele   GRCh37Default   GRCh38Default
ABCB1   *1              *2                      *2
NAT2    *4              *12                     *12
SLC22A2 *1              *3                      *3
UGT2B7  *1              *2                      *2
UGT2B15 *1              *2                      *2

I found this logic check to assign allele where no candidate is found, but still, I'm not fully understand the role of RefAllele

if ref_allele != default_allele and ref_allele not in candidates and default_allele not in candidates:
    candidates.append(default_allele)
if not candidates:
    candidates.append(default_allele)

Could you please explain what is RefAllele please? And how to assign it in gene-table? Thank you very much.

sbslee commented 2 years ago

@NTNguyen13,

Good question! The RefAllele column in the gene-table.csv file gives you reference STAR allele for the given gene (some people refer it as "wild-type" allele, but reference allele is the preferred term).

For example, the CYP2D6 gene has 1 as reference allele and therefore RefAllele is 1. Now, if you look at the CYP2D6 sequence of GRCh37, you will find that it actually matches that of 2; therefore, the GRCh37Default column is 2. Finally, when you do the same for GRCh38, its CYP2D6 sequence matches that of 1 and so GRCh38Default is 1.

Let me know if you have more questions.

P.S. You will see that the NAT2 gene has 4 as reference allele instead of 1. That's for historical reasons. See the official NAT2 alleles page for more details (http://nat.mbg.duth.gr/Human%20NAT2%20alleles_2013.htm).

NTNguyen13 commented 2 years ago

Thank you for the quick response! So if I find a new gene to add to pypgx, I can assign RefAllele based on literature review, GRCh37Default and GRCh38Default based on the human genome sequence, depended on assembly versions, am I right?

sbslee commented 2 years ago

That's correct! Though I would strongly advise that if you have a PGx gene you'd like to add to PyPGx, please first open a new issue in the repository for discussion before making a PR 😄

NTNguyen13 commented 2 years ago

yes, I'm definitely gonna follow that!