bgen_reader.allele_expectation allocates memory based on unindexed genotype

jordanero commented 3 years ago

bgen_reader.allele_expectation allocates memory based on the unindexed genotype. This causes problems when indexing a large bgen (for example UKBioBank).

The following code attempts to allocate a 4.45TiB array when computing the expectation for a single variant and sample

from bgen_reader import open_bgen bgen = open_bgen('ukb_imp_chr22_v3.bgen', samples_filepath = 'ukb1404_imp_chr1_v2_s487406.sample', verbose = True) bgen.allele_expectation(index = c(1,1)) Traceback (most recent call last): File "", line 1, in File "/n/home12/jrossen/.conda/envs/python3/lib/python3.8/site-packages/bgen_reader/_bgen2.py", line 1381, in allele_expectation ploidy0 = self.read(return_probabilities=False, return_ploidies=True)[ File "/n/home12/jrossen/.conda/envs/python3/lib/python3.8/site-packages/bgen_reader/_bgen2.py", line 563, in read ploidy_val = np.full( File "/n/home12/jrossen/.conda/envs/python3/lib/python3.8/site-packages/numpy/core/numeric.py", line 343, in full a = empty(shape, dtype, order) numpy.core._exceptions.MemoryError: Unable to allocate 4.45 TiB for an array with shape (487409, 1255683) and data type int64

CarlKCarlK commented 3 years ago

JordenEro,

Thanks for your bug report and thanks for using bgen-reader.

This looks like a possible problem in my section of the code (_bgen2.py). I'm on vacation and may not be able to look at it fully for a week. In the meantime, you may be able to work around the problem by using the Dask-inspired API (Dask-Inspired API (original) - bgen-reader 4.0.8 documentationhttps://bgen-reader.readthedocs.io/en/latest/daskapi.html)

Yours, Carl

From: jordanero @.> Sent: Tuesday, August 24, 2021 2:47 PM To: limix/bgen-reader-py @.> Cc: Subscribed @.***> Subject: [limix/bgen-reader-py] bgen_reader.allele_expectation allocates memory based on unindexed genotype (#40)

bgen_reader.allele_expectation allocates memory based on the unindexed genotype. This causes problems when indexing a large bgen (for example UKBioBank).

The following code attempts to allocate a 4.45TiB array when computing the index for a single variant and sample

from bgen_reader import open_bgen bgen = open_bgen('ukb_imp_chr22_v3.bgen', samples_filepath = 'ukb1404_imp_chr1_v2_s487406.sample', verbose = True) bgen.allele_expectation(index = c(1,1)) Traceback (most recent call last): File "", line 1, in File "/n/home12/jrossen/.conda/envs/python3/lib/python3.8/site-packages/bgen_reader/_bgen2.py", line 1381, in allele_expectation ploidy0 = self.read(return_probabilities=False, return_ploidies=True)[ File "/n/home12/jrossen/.conda/envs/python3/lib/python3.8/site-packages/bgen_reader/_bgen2.py", line 563, in read ploidy_val = np.full( File "/n/home12/jrossen/.conda/envs/python3/lib/python3.8/site-packages/numpy/core/numeric.py", line 343, in full a = empty(shape, dtype, order) numpy.core._exceptions.MemoryError: Unable to allocate 4.45 TiB for an array with shape (487409, 1255683) and data type int64

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F40&data=04%7C01%7C%7C939ae3eab2134e13e21308d96737e99b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637654312098698505%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bKM6BoeS3owcRV1Bbhlu%2ByLau8TBrlZ9%2BRmennwZTLU%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P5MISBGCMEFWX67XETT6PZKRANCNFSM5CXQ75JA&data=04%7C01%7C%7C939ae3eab2134e13e21308d96737e99b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637654312098708458%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Lxyta3lVMKgEks5yqpNWwEKGbK6Qgz%2BFSVjWOShITHQ%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7C%7C939ae3eab2134e13e21308d96737e99b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637654312098708458%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=gEFPTi2KcqrsW2CDG%2FkV1iL4dwGODjaRHyUGIckN2zY%3D&reserved=0 or Androidhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26utm_campaign%3Dnotification-email&data=04%7C01%7C%7C939ae3eab2134e13e21308d96737e99b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637654312098718418%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BjduCwfO6tdGIBGltRA%2F6mw9mR2WkDQlWlDmSN2RnoI%3D&reserved=0.

jordanero commented 3 years ago

That's helpful. Thanks for making the package!

CarlKCarlK commented 3 years ago

This is fixed with branch "fixissue40".

@jordanero, you can install the fix early with pip install git+git://github.com/limix/bgen-reader-py.git@fixissue40

@horta When you get a chance, you can publish the fix?

Carl

limix / bgen-reader-py

bgen_reader.allele_expectation allocates memory based on unindexed genotype #40