Open tomwhite opened 4 years ago
Tom,
Danilo (the lead on this project) and I have a beta for a 2nd API that supports bulk reads. It is as fast as we could make it (reading about 4 million distributions per second). It is based on NumPy rather than Dask, but I assume you could convert any data to Dask as needed.
A beta quickstart document is here: https://fastlmm.github.io/PySnpTools/branch/bgen2/index.html.
You can install the beta for Linux 3.7 from here (3.8 is where’d you expect. Windows and MacOS available, too.) pip install https://github.com/fastlmm/PySnpTools/releases/download/v4.0.4/bgen_reader-4.0.4-cp37-cp37m-manylinux1_x86_64.whl
Let us know if this seems useful for you.
From: Tom White notifications@github.com Sent: Thursday, July 23, 2020 7:01 AM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
Thank you for all your work on this library! We are using bgen-reader-py to implement a Dask-based BGEN reader in pystatgen/sgkit-bgen#1https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpystatgen%2Fsgkit-bgen%2Fpull%2F1&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648889003&sdata=4NOVsKG19gUsgPJFUvAucIb6rKBxA7EY1xvF%2Fhdzogk%3D&reserved=0. There is already some support for Dask, but (as far as I can tell) the delayed object that is created is for a single row (variant), rather than a chunk of rows. (See https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fblob%2Fmaster%2Fbgen_reader%2F_genotype.py%23L46&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648898996&sdata=j7wcgygIfgRrUUSdUzPoK5pZbdsINUjiHrlCTVzCEkw%3D&reserved=0.) This is a good choice for random access, but may be prohibitive for bulk access.
Would it be possible to support bulk access with an API that returns the genotype probability array as a Dask array?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648898996&sdata=zTNVSdCWbwfmRzy5AgyWeYEYl%2BI2UCJqGd5rPEhqdxw%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P5NIAVH5B7ISVNJAOTR5A7B7ANCNFSM4PFYCRGQ&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648908993&sdata=GeYp1%2FTheXaZNvd2siDnt7JVvHSWF%2B46bx2i2%2BZ5210%3D&reserved=0.
@CarlKCarlK that's great to know! That looks very useful. I will dig in and try it out soon.
That should be done by end of this month (August).
@tomwhite I'm trying to figure out a nice way to provide bulk access.
As you probably know, the probabilities matrix of genotype 0 maybe have different number of columns of genotype 1 as the ploidy is allowed to vary. Not only that but a genotype is (in bgen terms) defined in terms of probabilities as well as being phased, ploidy, missingness.
In [6]: genotype[0].compute()
Out[6]:
{'probs': array([[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.],
...,
[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.]]),
'phased': False,
'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
'missing': array([False, False, False, ..., False, False, False])}
In [7]: genotype[1].compute()
Out[7]:
{'probs': array([[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
...,
[ 1., 0., 0.],
[ 1., 0., 0.],
[nan, nan, nan]]),
'phased': False,
'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
'missing': array([ True, True, True, ..., False, False, True])}
Without trying to be fancy, a quick way to implement bulk access is by indexing genotype by its "bulk index" instead of its genotype index. Something like that:
In [6]: bulk[0].compute().genotype[0]
Out[6]:
{'probs': array([[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.],
...,
[1., 0., 0.],
[1., 0., 0.],
[1., 0., 0.]]),
'phased': False,
'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
'missing': array([False, False, False, ..., False, False, False])}
In [7]: bulk[0].compute().genotype[1]
Out[7]:
{'probs': array([[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
...,
[ 1., 0., 0.],
[ 1., 0., 0.],
[nan, nan, nan]]),
'phased': False,
'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
'missing': array([ True, True, True, ..., False, False, True])}
Does it appeal you?
@CarlKCarlK pointed out that he already have bulk access implemented (in the numpy-like interface, accessed via open_bgen). I'm also interested in having that in the dask interface: via read_bgen
Just want to ensure @eric-czech is in this conversation too as he and Carl have been in discussion separately.
Hey @CarlKCarlK, I tested this on a single UKB file and had some results to share.
I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-test.
The file is only 4.6G and here are the times I got on a single core:
I did a little more profiling on the new beta in bgen_reader2_prof.ipynb, though this only loads 10k variants of 45,906 to speed things up. I found that in reading these 10k variants after deleting the metafile, I see resource utilization like this:
The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable.
Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that?
A few smaller questions:
metadata -- time=0:02:41.61, step 3: part 45,000 of 45,906
Found 10000 variants
reading -- time=0:00:34.72, part 990 of 1,000 # --> Never makes it to 1,000
reading -- time=0:00:37.07, part 990 of 1,000
If the progress prints are coming from C++, https://github.com/p-ranav/indicators is an option.
Eric,
Thanks! At the bottom of this node is the inner loop of the Python Bgen2 reader. In summary, it reads one variant at a time with these steps:
The Open, Read, and Close methods are @Danilo Hortamailto:danilo.horta@pm.me’s C/C++ code.
I haven’t profiled a run (we should), but I assume most of the time is being spent in lib.bgen_genotype_read.
p.s. I fixed the logging off-by-one errors and added a TODO to look at tqdmhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0..
From https://github.com/limix/bgen-reader-py/blob/bgen2memmap/bgen_reader/_bgen2.py
with _log_in_place("reading", self._verbose) as updater:
for out_index, vaddr0 in enumerate(vaddr):
if out_index % vaddr_per_second == 0:
updater("part {0:,} of {1:,}".format(out_index, len(vaddr)))
genotype = lib.bgen_file_open_genotype(self._bgen._bgen_file, vaddr0)
if return_probabilities:
if (
prob_buffer is None
or ncombinations[out_index] != prob_buffer.shape[-1]
):
prob_buffer = np.full( # LATER could offer an option to memmap this buffer
(self.nsamples, ncombinations[out_index]),
np.nan,
order="C",
dtype="float64",
)
lib.bgen_genotype_read(
genotype, ffi.cast("double *", prob_buffer.ctypes.data)
)
val[:, out_index, : ncombinations[out_index]] = (
prob_buffer
if (samples_index is self._sample_range)
else prob_buffer[samples_index, :]
)
if return_missings:
[…]
if return_ploidies:
[…]
lib.bgen_genotype_close(genotype)
From: Jeff Hammerbacher notifications@github.com Sent: Monday, August 10, 2020 12:39 PM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com Cc: Carl Kadie carlk@msn.com; Mention mention@noreply.github.com Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
If the progress prints are coming from C++, https://github.com/p-ranav/indicatorshttps://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fp-ranav%2Findicators&data=02%7C01%7C%7C468ab92806744667e3bd08d83d64fbea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326851211525643&sdata=Fthje7kaMsdQmxJJPDsdpAHW0EeAooKyc5rp1xhhM60%3D&reserved=0 is an option. — You are receiving this because you were mentioned. Reply to this email directly
From: Eric Czech notifications@github.com Sent: Monday, August 10, 2020 12:08 PM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com Cc: Carl Kadie carlk@msn.com; Mention mention@noreply.github.com Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
Hey @CarlKCarlKhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965645877&sdata=sqBaai68dP2FhoJn%2BuedXbRih3MaXzElAo%2Fs35LpbR0%3D&reserved=0, I tested this on a single UKB file and had some results to share.
I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-testhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Ftree%2Fmaster%2Fio%2Fbgen%2Fukb-test&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965655878&sdata=KFJnv0cNdP3q9d2OreRZD29XLvyewpFz%2FOYWTxF%2BnOg%3D&reserved=0.
The file is only 4.6G and here are the times I got on a single core:
I did a little more profiling on the new beta in bgen_reader2_prof.ipynbhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnbviewer.jupyter.org%2Fgithub%2Feric-czech%2Fsgkit-dev%2Fblob%2F34dba75a7db899b046d779e7373c4eb2ebd258ca%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.ipynb&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965665867&sdata=QgVmI7N7mrtJ1RvViuNMj4R9w5I7IXELbByS2ouLI3Y%3D&reserved=0. I found that in reading 10k variants of 45,906 after deleting the metafile, I see resource utilization like this:
The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable.
Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that?
A few smaller questions:
metadata -- time=0:02:41.61, step 3: part 45,000 of 45,906
Found 10000 variants
reading -- time=0:00:34.72, part 990 of 1,000 # --> Never makes it to 1,000
reading -- time=0:00:37.07, part 990 of 1,000
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30%23issuecomment-671534711&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965695848&sdata=SsSP0Qz5tGe8ZoR3Yiar5D3a5YXAlJoES7t4SFZCMp4%3D&reserved=0, or unsubscribehttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P2MKYTB4YL64D2PG2TSABAR5ANCNFSM4PFYCRGQ&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965705845&sdata=dUOMyAnSB%2B8snkAI%2BpS87REdw%2FYq6ouDs5%2F3MN1%2F1zs%3D&reserved=0.
Update:
Danilo has ideas on making it faster with C/C++ batch readers.
I did the profiling: 84% of time was in the current read, 12% in the open. Everything else was a small %.
p.s. Notes to myself:
From: Carl KADIE Sent: Monday, August 10, 2020 1:31 PM To: limix/bgen-reader-py reply@reply.github.com; limix/bgen-reader-py bgen-reader-py@noreply.github.com; Danilo Horta danilo.horta@pm.me Cc: Mention mention@noreply.github.com Subject: RE: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
Eric,
Thanks! At the bottom of this node is the inner loop of the Python Bgen2 reader. In summary, it reads one variant at a time with these steps:
The Open, Read, and Close methods are @Danilo Hortamailto:danilo.horta@pm.me’s C/C++ code.
I haven’t profiled a run (we should), but I assume most of the time is being spent in lib.bgen_genotype_read.
p.s. I fixed the logging off-by-one errors and added a TODO to look at tqdmhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0..
From https://github.com/limix/bgen-reader-py/blob/bgen2memmap/bgen_reader/_bgen2.py
with _log_in_place("reading", self._verbose) as updater:
for out_index, vaddr0 in enumerate(vaddr):
if out_index % vaddr_per_second == 0:
updater("part {0:,} of {1:,}".format(out_index, len(vaddr)))
genotype = lib.bgen_file_open_genotype(self._bgen._bgen_file, vaddr0)
if return_probabilities:
if (
prob_buffer is None
or ncombinations[out_index] != prob_buffer.shape[-1]
):
prob_buffer = np.full( # LATER could offer an option to memmap this buffer
(self.nsamples, ncombinations[out_index]),
np.nan,
order="C",
dtype="float64",
)
lib.bgen_genotype_read(
genotype, ffi.cast("double *", prob_buffer.ctypes.data)
)
val[:, out_index, : ncombinations[out_index]] = (
prob_buffer
if (samples_index is self._sample_range)
else prob_buffer[samples_index, :]
)
if return_missings:
[…]
if return_ploidies:
[…]
lib.bgen_genotype_close(genotype)
From: Jeff Hammerbacher notifications@github.com<mailto:notifications@github.com> Sent: Monday, August 10, 2020 12:39 PM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com<mailto:bgen-reader-py@noreply.github.com> Cc: Carl Kadie carlk@msn.com<mailto:carlk@msn.com>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
If the progress prints are coming from C++, https://github.com/p-ranav/indicatorshttps://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fp-ranav%2Findicators&data=02%7C01%7C%7C468ab92806744667e3bd08d83d64fbea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326851211525643&sdata=Fthje7kaMsdQmxJJPDsdpAHW0EeAooKyc5rp1xhhM60%3D&reserved=0 is an option. — You are receiving this because you were mentioned. Reply to this email directly
From: Eric Czech notifications@github.com<mailto:notifications@github.com> Sent: Monday, August 10, 2020 12:08 PM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com<mailto:bgen-reader-py@noreply.github.com> Cc: Carl Kadie carlk@msn.com<mailto:carlk@msn.com>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
Hey @CarlKCarlKhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965645877&sdata=sqBaai68dP2FhoJn%2BuedXbRih3MaXzElAo%2Fs35LpbR0%3D&reserved=0, I tested this on a single UKB file and had some results to share.
I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-testhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Ftree%2Fmaster%2Fio%2Fbgen%2Fukb-test&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965655878&sdata=KFJnv0cNdP3q9d2OreRZD29XLvyewpFz%2FOYWTxF%2BnOg%3D&reserved=0.
The file is only 4.6G and here are the times I got on a single core:
I did a little more profiling on the new beta in bgen_reader2_prof.ipynbhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnbviewer.jupyter.org%2Fgithub%2Feric-czech%2Fsgkit-dev%2Fblob%2F34dba75a7db899b046d779e7373c4eb2ebd258ca%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.ipynb&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965665867&sdata=QgVmI7N7mrtJ1RvViuNMj4R9w5I7IXELbByS2ouLI3Y%3D&reserved=0. I found that in reading 10k variants of 45,906 after deleting the metafile, I see resource utilization like this:
The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable.
Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that?
A few smaller questions:
metadata -- time=0:02:41.61, step 3: part 45,000 of 45,906
Found 10000 variants
reading -- time=0:00:34.72, part 990 of 1,000 # --> Never makes it to 1,000
reading -- time=0:00:37.07, part 990 of 1,000
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30%23issuecomment-671534711&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965695848&sdata=SsSP0Qz5tGe8ZoR3Yiar5D3a5YXAlJoES7t4SFZCMp4%3D&reserved=0, or unsubscribehttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P2MKYTB4YL64D2PG2TSABAR5ANCNFSM4PFYCRGQ&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965705845&sdata=dUOMyAnSB%2B8snkAI%2BpS87REdw%2FYq6ouDs5%2F3MN1%2F1zs%3D&reserved=0.
Thanks for the benchmark, @eric-czech !
I have found some bootlenecks on the bgen-reader. Here is a small comparison now:
% python pybgen_exec.py run --path=$HOME/carl/merged_487400x220000.bgen
0%| | 778/220000 [00:18<1:28:18, 41.37it/s]
% python bgen_reader_exec.py run --path=$HOME/carl/merged_487400x220000.bgen
0%| | 483/220000 [00:10<1:22:19, 44.44it/s]
I have stopped more or less after 10 seconds just to have an idea.
I modified your pybgen_exec.py
only to use tqdm:
import fire
import tqdm
from pybgen import PyBGEN
def run(path):
ct = 0
with PyBGEN(path) as bgen:
n = bgen.nb_variants
for info, dosage in tqdm.tqdm(bgen, total=n):
assert dosage.ndim == 1
ct += dosage.size
print(f"Number of entries read: {ct}")
fire.Fire()
I will try to release tonight a version 4.0.5 for bgen-reader with that speed-up. It would be great if you could find time to run the benchmarks again on that new version (I come back here as soon as I release it).
It will come with the Carl API as well, which might be in the state of experimental but I think that is fine for now.
Notice that I've not used bulk reading (nor have implemented it yet), although reading the variants in order using bgen_reader.read_bgen
is faster than in random order.
@hammer Thanks for the pointer. I didnt know about that library.
I've coded the low-level of bgen in C and at that time I did not find a nice progress bar library for C only. So I coded mine: https://github.com/horta/almosthere It is very humble compared to the one you showed: ASCII based only.
Aside:
I have my own, too 😊 A single 45-lines Python function called “_log_in_place". Here is a usage example. Notice the “i+1”. That fixes the off-by-one error Eric found.
with _log_in_place("metadata", self._verbose) as updater:
for i in range(self.nsamples):
if i % 1000 == 0:
updater("'sample range': part {0:,} of {1:,}".format(i+1, self.nsamples))
sample_range_memmap[i] = i
From: Danilo Horta notifications@github.com Sent: Monday, August 10, 2020 3:02 PM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com Cc: Carl Kadie carlk@msn.com; Mention mention@noreply.github.com Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)
@hammerhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhammer&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302180578&sdata=WZVbsv7QfG6QSjZFnVwEgJMoXwGVOEeRYUZP4V2oS1I%3D&reserved=0 Thanks for the pointer. I didnt know about that library.
I've coded the low-level of bgen in C and at that time I did not find a nice progress bar library for C only. So I coded mine: https://github.com/horta/almostherehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhorta%2Falmosthere&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302180578&sdata=J9YQZijOG%2F%2FXAAtc1f9TEMZatyYWBvrUDloorhioLBo%3D&reserved=0 It is very humble compared to the one you showed: ASCII based only.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30%23issuecomment-671611940&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302190564&sdata=G0FrivExKUsOtQ1DRMMmVfaVAkJYA9xucTSHF1kTTdc%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65PYV5BIHIW4FEOFXQSTSABU6BANCNFSM4PFYCRGQ&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302190564&sdata=qeFoVB0hyQg1kl0TQRHnHewAHfIZNSPAwbHn7ICpVlk%3D&reserved=0.
bgen-reader 4.0.5 has been released: https://pypi.org/project/bgen-reader/
It should be a lot faster now. I will be working on the bulk idea or on the Carl's interface.
The Open, Read, and Close methods are @Danilo Hortamailto:danilo.horta@pm.me’s C/C++ code.
Good to know, thanks @CarlKCarlK.
It would be great if you could find time to run the benchmarks again on that new version (I come back here as soon as I release it). bgen-reader 4.0.5 has been released: https://pypi.org/project/bgen-reader/
Beautiful, appreciate the quick turnaround @horta! I'll give that a try again soon.
It should be a lot faster now. I will be working on the bulk idea or on the Carl's interface.
Definitely! Here are the times I got with the new code in context with the previous times:
Here is the resource utilization profile I saw for bgen_reader 4.0.5 too:
The low memory usage there is great! I know that seems to imply CPU usage was also 100% but it's actually 100% out of 400% (4 cores on the VM), so it seemed like there wasn't any real benefit to the threading via dask delayed, assuming that's how it still works. I verified by running a big random matrix multiplication which produces a similar graph at 400% CPU utilization across the board. All the related code is in bgen_reader_prof.ipynb.
I'm assuming that the GIL is the culprit with the default Dask scheduler though I'm curious @horta if you did anything special to see higher concurrency when you switched to Dask in the past? I could try with scheduler='processes' but my experience with that is that it usually ends up being pretty close to a wash with scheduler='threads' because of all the serialization for inter-process communication. Let me know if you have any advice.
EDIT: I tried it with the multiprocessing scheduler instead and it was very slow (on track for ~11 hrs) but that's not very fair because I'm asking it to send all the data back to the driver process rather than some reduction over it. It would be great if Dask didn't have this inter-process communication bottleneck -- it keeps coming up!
Also out of curiosity @horta I was trying to find some commits related to your changes that made such a huge improvement in 4.0.5. Was there one you think would be good look at to highlight the difference? I was just hoping to get a bit more familiar with how it all works internally. Thanks again for whatever you did!
This one: https://github.com/limix/bgen-reader-py/commit/b40fc367bef652b36703ff75e23177c194ddd076
Before, with a dataset having 500k samples, it would read one variant per time. Per time meaning passing a single variant offset to the function read_genotype_partition
, called from here: https://github.com/limix/bgen-reader-py/blob/eceda00c6f34fa66882c39e210407e50f5f8ebd8/bgen_reader/_genotype.py#L42
Everytime the function read_genotype_partition
is called, a bgen_file
is opened (which opens the bgen file, read the header, and ftell
). So minimizing that call will minimize that overhead.
There are basically only two overheads that, if minimized, would make everything even faster:
bgen_metafile
bgen_file
If you have a single core to use and want to process the whole bgen file as fast as possible, it would suffice to do something like:
with bgen_metafile(metafile_filepath) as mf:
with bgen_file(bgen_filepath) as bgen:
# Loop over variants in sequence
You can do the above very easily by using this package I released last week which will be used by bgen-reader very soon: https://github.com/limix/cbgen-py
I believe Carl's interface try to work along those lines but using memmap from numpy.
You will find examples of using cbgen
here: https://github.com/limix/cbgen-py/blob/master/cbgen/test/test_bgen.py
Or you can wait until we make new releases of bgen-reader.
Everytime the function read_genotype_partition is called, a bgen_file is opened (which opens the bgen file, read the header, and ftell). So minimizing that call will minimize that overhead
👍
You can do the above very easily by using this package I released last week which will be used by bgen-reader very soon: https://github.com/limix/cbgen-py
Awesome! Ok, I will give that a spin -- thanks for the examples!
Hey @horta I tried using cbgen
rather than bgen-reader
and everything worked nicely!
As an experiment I was going to try to lift the file open/close logic out of _bgen_file.py#L70-L71 into my own loop but that's a bit out of my depth. I wasn't sure what the offset should be in:
def read_genotype(self, offset: int) -> Genotype:
gt: CData = lib.bgen_file_open_genotype(self._bgen_file, offset)
...
lib.bgen_genotype_close(gt)
Is there a fairly easy way to do that?
Oh and FYI here is a resource profile I had in using cbgen to read a UKB contig:
This is from using a cbgen_reader.py wrapper I made to potentially replace the sgkit-bgen#bgen_reader.py wrapper and I thought the sawtooth memory usage was odd. I didn't see that earlier in https://github.com/limix/bgen-reader-py/issues/30#issuecomment-674923428 when reading the variants in order but I'm not sure what is being "built up" in this case. The wrapper is pre-allocating result arrays so the memory usage shouldn't be so incremental.
Might there be anything in cbgen
or bgen_reader
that would cause this gradual increase in memory used when reading blocks of variants out-of-order?
Hey @eric-czech , the offset for each variant is the result of ftell
on the start of the variant block in the bgen file: https://github.com/limix/bgen/blob/a311bcfd32ece93b92d9c2cd9bf964e8d55cb1a6/src/file.c#L145
There is no easy way to get all those offsets. As the bgen file specification stands, you don't know where the variant block start for variant n
without knowing where the variant n-1
started (and knowning the size of variant block n-1
.
That is one of the reasons index files exist for bgen files.
Regarding memory, I don't know. Isn't that just garbage collection by Python? (you can try to use gc.collect()
to exclude that out)
Also, if that is out-of-order, and assuming that you reading in a uniformly random fashion, why that patteern would be so precise? It looks odd to me too but I've to admit that I don't have much experience on how garbage collection tend to behave. cbgen
library does not do anything fancy: allocate numpy arrays, use the C bgen library to copy the requested data to them, and that is over.
There is no easy way to get all those offsets. As the bgen file specification stands, you don't know where the variant block start for variant n without knowing where the variant n-1 started (and knowning the size of variant block n-1.
Is it not possible to:
lib.bgen_file_open_genotype(self._bgen_file, offset)
with that offsetI have no idea how the "Advance the open file" part would work in cbgen -- is that more difficult than it sounds?
Also, if that is out-of-order, and assuming that you reading in a uniformly random fashion
Mm well what I mean is Dask asks us to fetch some block of adjacent variants (say 100 of them), and there guarantees on which order it will ask for the blocks but they do typically go in order. I.e. for two workers process/thread 1 fetches variants 0 to 100 while process/thread 2 fetches variants 100 to 200 then worker 1 does 200 to 300, etc. "Arbitrary blocks of some size" is probably what I should have called it rather than random.
cbgen library does not do anything fancy: allocate numpy arrays, use the C bgen library to copy the requested data to them, and that is over.
Makes sense. It's probably something in Dask or our wrapper then -- I saw a similar pattern when making a Dask wrapper around pybgen too.
I've created an issue to perform "advance the open file". I think it won't change anything: opening variant genotype in sequence should have the same effect as "advancing the open file". Important bit of the C library for that: https://github.com/limix/bgen/blob/a311bcfd32ece93b92d9c2cd9bf964e8d55cb1a6/src/file.c#L145
I indeed perform a fseek
each time one opens a variant genotype but I believe the underlying implementation is smart enough to figure out that we are already at the right file position. (I just couldn't find an explicit explanation about that on google)
Just to clarify: after you close an openened variant genotype n
, the file stream poisition remains at the start position of variant genotype n+1
. Which means that subsequently opening variant genotype n+1
and reading it, should have zero overhead regarding file stream position (I'm assuming here that performing fseek
that amounts to the current position will have zero overhead.)
Just to clarify: after you close an openened variant genotype n, the file stream poisition remains at the start position of variant genotype n+1. Which means that subsequently opening variant genotype n+1 and reading it, should have zero overhead regarding file stream position
Ah ok, thanks @horta.
The biggest killer on reading to a Dask array then, AFAICT, is the lack of sample slicing for chunks. Right now, the way we're reading the data in might say that chunks are of size say 1,000 variants x 10,000 samples, but that still means we have to read all the samples for each variant + chunk (so I'm reading the same variant data upwards of 10 times for UKB).
Do you know if there's some relatively simple way to have cbgen/bgen-reader not deserialize all the unwanted samples into memory? In other words, is it possible to do something like this:
Let me know if that doesn't make sense.
The biggest killer on reading to a Dask array then, AFAICT, is the lack of sample slicing for chunks.
Another way of tackling this would be to convert to an intermediate chunked format (like zarr). This would be done once, then analysis would be performed on the intermediate format, which would have efficient, chunked read patterns. Perhaps this could be something we recommend for large numbers of samples?
This is the way I am doing it for VCF, which is the only real option as we don't have offset information for the n-th variant, so we need to convert to something like zarr anyway.
Another way of tackling this would be to convert to an intermediate chunked format (like zarr).
That's basically what I'm after, a more efficient way to do that conversion. Are you talking about some process for it that doesn't use Dask?
I'd like to think of this as a one-time conversion that we eat the cost of but even now when I'm using ridiculously wide chunks (~1,000 x ~100,000) to avoid some of the repetitive sample reads, it looks it will cost roughly $300 for the vCPU+memory hours (storage aside). That isn't so bad I suppose but I'd like to get those chunks down to more like 10k x 10k, which would hypothetically cost ~5x as much if the write and read workloads are fairly balanced (IO/network bandwidth aside).
Are you talking about some process for it that doesn't use Dask?
No, it still uses Dask, but instead of using dask.array.from_array
to create a Dask array and then write it to Zarr, use Dask delayed to read ranges of variants in parallel (and all their samples) and write to Zarr directly. This should avoid repetitive sample reads, while still getting chunking in the samples dimension (by setting the Zarr chunks appropriately).
I do something similar for VCF, but it should be simpler for BGEN since you can read variant ranges that align with Zarr chunks, which means you can avoid the concat and rechunk operations that are needed for VCF. See https://github.com/tomwhite/sgkit-vcf/blob/master/sgkit_vcf/vcf_reader.py#L206-L221
use Dask delayed to read ranges of variants in parallel (and all their samples) and write to Zarr directly
Ahhh I see what you mean. I would like to think that there is some way to seek between blocks of sample data for different variants in bgen, but I can't imagine how that's possible if compression is on. I'll try a similar bgen_to_zarr
function instead.
I would like to think that there is some way to seek between blocks of sample data for different variants in bgen, but I can't imagine how that's possible if compression is on.
Exactly. The genotype blocks in BGEN are compressed using zlib or zstd, and so there's no way to avoid decompressing the whole block to only pull out a subset of samples.
Just to clarify: after you close an openened variant genotype n, the file stream poisition remains at the start position of variant genotype n+1. Which means that subsequently opening variant genotype n+1 and reading it, should have zero overhead regarding file stream position
Ah ok, thanks @horta.
The biggest killer on reading to a Dask array then, AFAICT, is the lack of sample slicing for chunks. Right now, the way we're reading the data in might say that chunks are of size say 1,000 variants x 10,000 samples, but that still means we have to read all the samples for each variant + chunk (so I'm reading the same variant data upwards of 10 times for UKB).
Do you know if there's some relatively simple way to have cbgen/bgen-reader not deserialize all the unwanted samples into memory? In other words, is it possible to do something like this:
* Take a single variant and a subset of _adjacent_ samples to get data for * Seek to the start of the data for the variant * Further seek to the start of the data for the sample block * Scan and read all probabilities until you get to the end of the sample block
Let me know if that doesn't make sense.
If compressed, the whole structure of probabilities, ploidy, phasedness, missingness, of a single variant that includes all samples has to first me decompressed. So I think the answer to your question is no whenever bgen file is using compression (which I believe will be in most cases.).
If it is not compressed, then it seems possible to just read only the probabilities of the samples your are interested in (https://www.well.ox.ac.uk/~gav/bgen_format/spec/latest.html).
Ops, you already figured out the answer. Sorry, didn't read before answering!
Thank you for all your work on this library! We are using bgen-reader-py to implement a Dask-based BGEN reader in https://github.com/pystatgen/sgkit-bgen/pull/1. There is already some support for Dask, but (as far as I can tell) the delayed object that is created is for a single row (variant), rather than a chunk of rows. (See https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46.) This is a good choice for random access, but may be prohibitive for bulk access.
Would it be possible to support bulk access with an API that returns the genotype probability array as a Dask array?