limix / pandas-plink

PLINK reader for Python.
MIT License
78 stars 18 forks source link

genotype values issue #15

Closed liusy-jz closed 3 years ago

liusy-jz commented 4 years ago

I got genotypes a0 with 2 and a1 with 0 when i use pandas-plink, and my pandas-plink version is 2.0.4.

horta commented 4 years ago

I need more information to work on. Why is that an issue?

liusy-jz commented 4 years ago

In GWAS analysis, minor allele code as 2, and a0 in plink bed format is minor allele. In my own analysis, minor allele code as 2 got right result and minor allele code as 0 not.

AnneHartebrodt commented 3 years ago

The pandas-plink documentation references the plink2.0 documentation for the bed file format. There is a difference in the encoding for the genotype values for Plink1 and plink2 type files. See section 2.3.2 of this manual https://github.com/chrchang/plink-ng/blob/master/pgen_spec/pgen_spec.pdf.

In plink1 0 is homozygous ALT; 1 missing; 2 heterozygous ALT-REF; 3 homozygous ALT In plink2 0 is homozygous REF; 1 heterozygous ALT-REF; 2 homozygous ALT; 3 missing

I wonder whether the problem is linked to this change in the encoding. It might be the reason for the unexpected result. I think I encountered the same issue.

horta commented 3 years ago

Thanks for the info. I have time this afternoon to read the references and see if it is easy so support plink2 easily. I will try to get back here later today.

horta commented 3 years ago

Do you have an example?

I have worked on the develop branch towards plink2 but it will take a while. However, I double checked, and reading plink1 bed files should be working fine accordingly to the plink1 definitions.

There are two official definitions (unfortunately). First one says (http://zzz.bwh.harvard.edu/plink/binary.shtml)

00  -- homozygote (first)
11  -- other homozygote (second)

and the second has been pointed out by @AnneHartebrodt . Those references don't mention anything regarding allele frequency. Therefore, as far as I understand, both homozygote (first) or homozygous ALT refer to the a0 allele, as defined in the bim file. What pandas_link does is to return the count of a1 allele.

I have an example here:

>>> from pandas_plink import get_data_folder, read_plink1_bin
>>>
>>> folder = get_data_folder()
>>> data = read_plink1_bin(f"{folder}/chr11.bed", f"{folder}/chr11.bim", f"{folder}/chr11.fam")
>>>
>>> snp_index = 1
>>> a0 = data.a0[snp_index].item()
>>> a0
'G'
>>> a1 = data.a1[snp_index].item()
>>> a1
'C'
>>> snp = data[:, snp_index].values.astype(int)
>>> snp
array([0, 1, 0, 0, 0, 2, 2, 0, 2, 2, 2, 2, 1, 0])
>>> ["".join((2 - count) * [a0] + count * [a1]) for count in snp]
['GG',
 'GC',
 'GG',
 'GG',
 'GG',
 'CC',
 'CC',
 'GG',
 'CC',
 'CC',
 'CC',
 'CC',
 'GC',
 'GG']

Let me know if you agree. I might need to make it more clear in the documentation.

liusy-jz commented 3 years ago

I agree, and I have converted genotype array in my code.

AnneHartebrodt commented 3 years ago

Hi all,

I agree you should be able to transform the data, such that the allele count is correct, but 'inverting' the data frame, meaning setting 0 to 2 and 2 to 0.

If you want to reproduce the behavior, I attached a small toy example, which should help you see where the confusion comes from. If you use plink1 and plink2 the table is inverted as described above.

toy.zip

plink --bfile hapmap1_100_1 --recode A --out hapmap1_100_1.1 plink2 --bfile hapmap1_100_1 --recode A --out hapmap1_100_1.2

If I read the example file using pandas-plink, the data looks like the one produced by plink2. The reference allele is probably implicitly specified differently, hence the different behavior. Maybe the user should be forced to specify the reference explicitly.

Anyways, thanks for the awesome package,

Cheers, A

horta commented 3 years ago

Thanks a lot for the input. It is very helpful as I don't have any plink2 example.

I will work on this toy.zip and check it.

horta commented 3 years ago

Thanks for the data, @AnneHartebrodt . Please, have a look at the visualisation of toy.bed: toy.bed using plink1 definition

In a plink1 bed there are only three fields: magic number (2 bytes), storage type (1 byte), and a sequence of 2-bits defining (homozygote (first), other homozygote (second), heterozygote, and missing).

I liked your suggestion. But instead of forcing the user, I would have a default parameter (count_allele="a1") to match plink1's original definition which the user can change to count_allele="a0" if desired. I think it would make less simplicity and give the opportunity to change its behavior.

I will also thrown an error if the user tries to parse a plink2 file (not sure if Im doing it) to avoid any confusion.

horta commented 3 years ago

I have just released a new version (2.1.0) with performance improvement, allowing the user to fine tune the chunking and to select the reference allele. Doc: https://pandas-plink.readthedocs.io/en/latest/api/pandas_plink.read_plink1_bin.html

Closing it for now. If you still find any problem, please, feel free to reopen it.