Open Hoeze opened 5 years ago
Hi @Hoeze , you can use x
to index bgen["variants"]
and retrieve the information you want about the x-th variant: bgen["variants"].iloc[x]
.
Thanks for your fast response @horta. I tried your suggestion like in the following example:
In [22]: from bgen_reader import *
In [23]: bgen = read_bgen("complex.bgen")
Reading samples|===========================================================================================================================================================================================================================|
Mapping variants: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 51590.46it/s]
In [24]: geno = bgen["genotype"][8].compute()
In [25]: v = bgen["variants"].compute().iloc[8]
In [26]: print(v)
id
rsid M9
chrom 01
pos 9
nalleles 8
allele_ids A,G,GT,GTT,GTTT,GTTTT,GTTTTT,GTTTTTT
vaddr 783
Name: 8, dtype: object
In [27]: print(geno["probs"])
[[ 1. 0. 0. 0. 0. 0. 0. 0. nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
[ 0. 0. 0. 0. 0. 0. 1. 0. nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
[ 0. 0. 0. 0. 1. 0. 0. 0. nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
In [28]: print(geno["probs"].shape)
(4, 36)
I already found the allele_ids
, but how do I know which column in geno["probs"]
contains which genotype?
For example, column 1 could contain A/A
, A/G
, etc.
In the end, my target is to extract in a certain genomic range the most likely sequence for each individual.
That is the tricky part because BGEN allows for very general genotype. It depends wether you have Unphased genotype, Phased genotypes, number of alleles, ploiyd. The quick start gives a quick idea on how to perform the association between probability and genotype (in particular, read the comments in the code). But the ultimate answer is in the section Per-sample order of stored probabilities of bgen specification.
Hm, I see. How do you solve this problem in your projects? Do you have a method which calculates it? Otherwise, I could try to write one...
Have a look at allele_expectation
: https://bgen-reader.readthedocs.io/en/latest/expectation.html
But make sure you have unphased genotype: "This function supports unphased genotypes only."
Ah, thank you for the hint, I found what I searched for: https://github.com/limix/bgen-reader-py/blob/2651d8736c36e797b029dedc0f1ccaf65fa735f8/bgen_reader/_helper.py#L1
Would it be possible to have this function publicly exported in genotype?
Sure. Will do it for the 3.1.0 release
Hi, how do I retrieve the genotype column ID for
genotype[x].compute()["probs"]
?