Closed liqingbioinfo closed 4 months ago
Hi @liqingbioinfo also found this. You can however fix it using ref="a0"
:
from os.path import join
from pandas_plink import read_plink1_bin
from pandas_plink import get_data_folder
G = read_plink1_bin(join(get_data_folder(), "chr*.bed"), verbose=False, ref="a0")
G[0:5, 0:5].compute() # the first 5x5
Which should get you:
array([[2., 2., 0., 2., 2.],
[2., 1., 0., 2., 2.],
[2., 2., 0., 1., 2.],
[2., 2., 0., 2., 2.],
[2., 2., 0., 2., 2.]], dtype=float32)
As opposed to using ref="a1"
(which is the default):
array([[0., 0., 2., 0., 0.],
[0., 1., 2., 0., 0.],
[0., 0., 2., 1., 0.],
[0., 0., 2., 0., 0.],
[0., 0., 2., 0., 0.]], dtype=float32)
This is also stated in the docs:
ref (str) – Reference allele. Specify which allele the dosage matrix will count. It can be either "a1" (default) or "a0".
Is a1
the sensible default? If so, I guess you can close this.
I will have a look at it later today.
If I’m not mistaken, I chose not to select a reference allele based on its frequency to (i) avoid having to read the entire dataset to decide which allele to use, and (ii) avoid arbitrarily determining a reference allele when the frequencies are identical. Instead, I use a1 by default and allow the user to select a different one if they wish.
It would be great if there is a way to avoid the above confusion. Let me know if there is!
I just wondered if by convention, one or the other is the ref.
On Wed, 19 Jun 2024 at 21:33, Danilo Horta @.***> wrote:
It would be great if there is a way to avoid the above confusion. Let me know if there is!
— Reply to this email directly, view it on GitHub https://github.com/limix/pandas-plink/issues/28#issuecomment-2179424632, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA7NEZ6FSTSEMORWGYELOTZIHTJRAVCNFSM6AAAAABJRRPJL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZZGQZDINRTGI . You are receiving this because you commented.Message ID: @.***>
Hi there,