limix / bgen-reader-py

A BGEN file format reader.
MIT License
10 stars 3 forks source link

Genotypes in probabilities? #16

Open Hoeze opened 5 years ago

Hoeze commented 5 years ago

Hi, how do I retrieve the genotype column ID for genotype[x].compute()["probs"]?

horta commented 5 years ago

Hi @Hoeze , you can use x to index bgen["variants"] and retrieve the information you want about the x-th variant: bgen["variants"].iloc[x].

Hoeze commented 5 years ago

Thanks for your fast response @horta. I tried your suggestion like in the following example:

In [22]: from bgen_reader import *                                                                                                                                                                                                          

In [23]: bgen = read_bgen("complex.bgen")                                                                                                                                                                                                   
Reading samples|===========================================================================================================================================================================================================================|
Mapping variants: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 51590.46it/s]

In [24]: geno = bgen["genotype"][8].compute()                                                                                                                                                                                               

In [25]: v = bgen["variants"].compute().iloc[8]                                                                                                                                                                                             

In [26]: print(v)                                                                                                                                                                                                                           
id                                                
rsid                                            M9
chrom                                           01
pos                                              9
nalleles                                         8
allele_ids    A,G,GT,GTT,GTTT,GTTTT,GTTTTT,GTTTTTT
vaddr                                          783
Name: 8, dtype: object

In [27]: print(geno["probs"])                                                                                                                                                                                                               
[[ 1.  0.  0.  0.  0.  0.  0.  0. nan nan nan nan nan nan nan nan nan nan
  nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
 [ 0.  0.  0.  0.  0.  0.  1.  0. nan nan nan nan nan nan nan nan nan nan
  nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
 [ 0.  0.  0.  0.  1.  0.  0.  0. nan nan nan nan nan nan nan nan nan nan
  nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

In [28]: print(geno["probs"].shape)                                                                                                                                                                                                         
(4, 36)

I already found the allele_ids, but how do I know which column in geno["probs"] contains which genotype? For example, column 1 could contain A/A, A/G, etc.

In the end, my target is to extract in a certain genomic range the most likely sequence for each individual.

horta commented 5 years ago

That is the tricky part because BGEN allows for very general genotype. It depends wether you have Unphased genotype, Phased genotypes, number of alleles, ploiyd. The quick start gives a quick idea on how to perform the association between probability and genotype (in particular, read the comments in the code). But the ultimate answer is in the section Per-sample order of stored probabilities of bgen specification.

Hoeze commented 5 years ago

Hm, I see. How do you solve this problem in your projects? Do you have a method which calculates it? Otherwise, I could try to write one...

horta commented 5 years ago

Have a look at allele_expectation: https://bgen-reader.readthedocs.io/en/latest/expectation.html

horta commented 5 years ago

But make sure you have unphased genotype: "This function supports unphased genotypes only."

Hoeze commented 5 years ago

Ah, thank you for the hint, I found what I searched for: https://github.com/limix/bgen-reader-py/blob/2651d8736c36e797b029dedc0f1ccaf65fa735f8/bgen_reader/_helper.py#L1

Would it be possible to have this function publicly exported in genotype?

horta commented 5 years ago

Sure. Will do it for the 3.1.0 release