limix / pandas-plink

PLINK reader for Python.
MIT License
79 stars 18 forks source link

Genotype code wrong! #28

Closed liqingbioinfo closed 4 months ago

liqingbioinfo commented 2 years ago

Hi there,

 I was surprised to find a major error in your pandas-plink package! I used it to read /bed/bim/fam files into python. But I found that my minor alleles homozygous were coded as 0 and major alleles homozygous were coded as 2!!! That invests all my results! Can you please check your source code to correct such an error! Otherwise, it's really dangerous for others to continue using this pandas-plink package!!!

image

KennethEnevoldsen commented 2 years ago

Hi @liqingbioinfo also found this. You can however fix it using ref="a0":

from os.path import join
from pandas_plink import read_plink1_bin
from pandas_plink import get_data_folder

G = read_plink1_bin(join(get_data_folder(), "chr*.bed"), verbose=False, ref="a0")
G[0:5, 0:5].compute() # the first 5x5

Which should get you:

array([[2., 2., 0., 2., 2.],
       [2., 1., 0., 2., 2.],
       [2., 2., 0., 1., 2.],
       [2., 2., 0., 2., 2.],
       [2., 2., 0., 2., 2.]], dtype=float32)

As opposed to using ref="a1" (which is the default):

array([[0., 0., 2., 0., 0.],
       [0., 1., 2., 0., 0.],
       [0., 0., 2., 1., 0.],
       [0., 0., 2., 0., 0.],
       [0., 0., 2., 0., 0.]], dtype=float32)

This is also stated in the docs:

ref (str) – Reference allele. Specify which allele the dosage matrix will count. It can be either "a1" (default) or "a0".

dbolser commented 4 months ago

Is a1 the sensible default? If so, I guess you can close this.

horta commented 4 months ago

I will have a look at it later today.

horta commented 4 months ago

If I’m not mistaken, I chose not to select a reference allele based on its frequency to (i) avoid having to read the entire dataset to decide which allele to use, and (ii) avoid arbitrarily determining a reference allele when the frequencies are identical. Instead, I use a1 by default and allow the user to select a different one if they wish.

horta commented 4 months ago

It would be great if there is a way to avoid the above confusion. Let me know if there is!

dbolser commented 4 months ago

I just wondered if by convention, one or the other is the ref.

On Wed, 19 Jun 2024 at 21:33, Danilo Horta @.***> wrote:

It would be great if there is a way to avoid the above confusion. Let me know if there is!

— Reply to this email directly, view it on GitHub https://github.com/limix/pandas-plink/issues/28#issuecomment-2179424632, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA7NEZ6FSTSEMORWGYELOTZIHTJRAVCNFSM6AAAAABJRRPJL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGQZDINRTGI . You are receiving this because you commented.Message ID: @.***>