Feature request: bulk distributed access

tomwhite commented 4 years ago

Thank you for all your work on this library! We are using bgen-reader-py to implement a Dask-based BGEN reader in https://github.com/pystatgen/sgkit-bgen/pull/1. There is already some support for Dask, but (as far as I can tell) the delayed object that is created is for a single row (variant), rather than a chunk of rows. (See https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46.) This is a good choice for random access, but may be prohibitive for bulk access.

Would it be possible to support bulk access with an API that returns the genotype probability array as a Dask array?

CarlKCarlK commented 4 years ago

Tom,

Danilo (the lead on this project) and I have a beta for a 2nd API that supports bulk reads. It is as fast as we could make it (reading about 4 million distributions per second). It is based on NumPy rather than Dask, but I assume you could convert any data to Dask as needed.

A beta quickstart document is here: https://fastlmm.github.io/PySnpTools/branch/bgen2/index.html.

You can install the beta for Linux 3.7 from here (3.8 is where’d you expect. Windows and MacOS available, too.) pip install https://github.com/fastlmm/PySnpTools/releases/download/v4.0.4/bgen_reader-4.0.4-cp37-cp37m-manylinux1_x86_64.whl

Let us know if this seems useful for you.

Carl https://github.com/carlkcarlk

From: Tom White notifications@github.com Sent: Thursday, July 23, 2020 7:01 AM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)

Thank you for all your work on this library! We are using bgen-reader-py to implement a Dask-based BGEN reader in pystatgen/sgkit-bgen#1https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpystatgen%2Fsgkit-bgen%2Fpull%2F1&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648889003&sdata=4NOVsKG19gUsgPJFUvAucIb6rKBxA7EY1xvF%2Fhdzogk%3D&reserved=0. There is already some support for Dask, but (as far as I can tell) the delayed object that is created is for a single row (variant), rather than a chunk of rows. (See https://github.com/limix/bgen-reader-py/blob/master/bgen_reader/_genotype.py#L46 https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fblob%2Fmaster%2Fbgen_reader%2F_genotype.py%23L46&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648898996&sdata=j7wcgygIfgRrUUSdUzPoK5pZbdsINUjiHrlCTVzCEkw%3D&reserved=0.) This is a good choice for random access, but may be prohibitive for bulk access.

Would it be possible to support bulk access with an API that returns the genotype probability array as a Dask array?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648898996&sdata=zTNVSdCWbwfmRzy5AgyWeYEYl%2BI2UCJqGd5rPEhqdxw%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P5NIAVH5B7ISVNJAOTR5A7B7ANCNFSM4PFYCRGQ&data=02%7C01%7C%7C6632ef428cb84168572c08d82f10d6db%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637311096648908993&sdata=GeYp1%2FTheXaZNvd2siDnt7JVvHSWF%2B46bx2i2%2BZ5210%3D&reserved=0.

tomwhite commented 4 years ago

@CarlKCarlK that's great to know! That looks very useful. I will dig in and try it out soon.

horta commented 4 years ago

That should be done by end of this month (August).

horta commented 4 years ago

@tomwhite I'm trying to figure out a nice way to provide bulk access.

As you probably know, the probabilities matrix of genotype 0 maybe have different number of columns of genotype 1 as the ploidy is allowed to vary. Not only that but a genotype is (in bgen terms) defined in terms of probabilities as well as being phased, ploidy, missingness.

In [6]: genotype[0].compute()
Out[6]:
{'probs': array([[1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        ...,
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.]]),
 'phased': False,
 'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
 'missing': array([False, False, False, ..., False, False, False])}

In [7]: genotype[1].compute()
Out[7]:
{'probs': array([[nan, nan, nan],
        [nan, nan, nan],
        [nan, nan, nan],
        ...,
        [ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [nan, nan, nan]]),
 'phased': False,
 'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
 'missing': array([ True,  True,  True, ..., False, False,  True])}

Without trying to be fancy, a quick way to implement bulk access is by indexing genotype by its "bulk index" instead of its genotype index. Something like that:

In [6]: bulk[0].compute().genotype[0]
Out[6]:
{'probs': array([[1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        ...,
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.]]),
 'phased': False,
 'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
 'missing': array([False, False, False, ..., False, False, False])}

In [7]: bulk[0].compute().genotype[1]
Out[7]:
{'probs': array([[nan, nan, nan],
        [nan, nan, nan],
        [nan, nan, nan],
        ...,
        [ 1.,  0.,  0.],
        [ 1.,  0.,  0.],
        [nan, nan, nan]]),
 'phased': False,
 'ploidy': array([2, 2, 2, ..., 2, 2, 2]),
 'missing': array([ True,  True,  True, ..., False, False,  True])}

Does it appeal you?

horta commented 4 years ago

@CarlKCarlK pointed out that he already have bulk access implemented (in the numpy-like interface, accessed via open_bgen). I'm also interested in having that in the dask interface: via read_bgen

hammer commented 4 years ago

Just want to ensure @eric-czech is in this conversation too as he and Carl have been in discussion separately.

eric-czech commented 4 years ago

Hey @CarlKCarlK, I tested this on a single UKB file and had some results to share.

I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-test.

The file is only 4.6G and here are the times I got on a single core:

PyBGEN: ~28 minutes (the .bgi file already exists)
bgen_reader: I stopped it after an hour but it was on track to take ~4 hours
bgen_reader2: ~34 minutes not including the metadata construction time

I did a little more profiling on the new beta in bgen_reader2_prof.ipynb, though this only loads 10k variants of 45,906 to speed things up. I found that in reading these 10k variants after deleting the metafile, I see resource utilization like this:

prof

The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable.

Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that?

A few smaller questions:

Have you ever used tqdm? Assuming the custom progress prints are coming from python and not c++, that could make things easier. I noticed a small bug with each print where they seem to be off on the end interval, i.e.
```
metadata -- time=0:02:41.61, step 3: part 45,000 of 45,906
Found 10000 variants
reading -- time=0:00:34.72, part 990 of 1,000 # --> Never makes it to 1,000 
reading -- time=0:00:37.07, part 990 of 1,000
```
I was doing my scan by grabbing row slices in batches of 1000 variants but it looked like the code was actually pulling them in smaller batches of 10. Each "part N" message was incrementing by 10 from 0 up to 1000. Does it work that way regardless of the file size?

hammer commented 4 years ago

If the progress prints are coming from C++, https://github.com/p-ranav/indicators is an option.

CarlKCarlK commented 4 years ago

Eric,

Thanks! At the bottom of this node is the inner loop of the Python Bgen2 reader. In summary, it reads one variant at a time with these steps:

Open that variant
If the buffer can be reused (usually the case), do nothing, otherwise allocate a new buffer (for ‘complex’ files)
Read the probability data
Copy results into output array
- (If samples are also being selected, do that, too)
Close the variant.

The Open, Read, and Close methods are @Danilo Hortamailto:danilo.horta@pm.me’s C/C++ code.

I haven’t profiled a run (we should), but I assume most of the time is being spent in lib.bgen_genotype_read.

Carl

p.s. I fixed the logging off-by-one errors and added a TODO to look at tqdmhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0..

From https://github.com/limix/bgen-reader-py/blob/bgen2memmap/bgen_reader/_bgen2.py

    with _log_in_place("reading", self._verbose) as updater:
        for out_index, vaddr0 in enumerate(vaddr):
            if out_index % vaddr_per_second == 0:
                updater("part {0:,} of {1:,}".format(out_index, len(vaddr)))

            genotype = lib.bgen_file_open_genotype(self._bgen._bgen_file, vaddr0)

            if return_probabilities:
                if (
                    prob_buffer is None
                    or ncombinations[out_index] != prob_buffer.shape[-1]
                ):
                    prob_buffer = np.full(  # LATER could offer an option to memmap this buffer
                        (self.nsamples, ncombinations[out_index]),
                        np.nan,
                        order="C",
                        dtype="float64",
                    )
                lib.bgen_genotype_read(
                    genotype, ffi.cast("double *", prob_buffer.ctypes.data)
                )
                val[:, out_index, : ncombinations[out_index]] = (
                    prob_buffer
                    if (samples_index is self._sample_range)
                    else prob_buffer[samples_index, :]
                )

            if return_missings:

[…]

            if return_ploidies:

[…]

            lib.bgen_genotype_close(genotype)

From: Jeff Hammerbacher notifications@github.com Sent: Monday, August 10, 2020 12:39 PM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com Cc: Carl Kadie carlk@msn.com; Mention mention@noreply.github.com Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)

If the progress prints are coming from C++, https://github.com/p-ranav/indicators https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fp-ranav%2Findicators&data=02%7C01%7C%7C468ab92806744667e3bd08d83d64fbea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326851211525643&sdata=Fthje7kaMsdQmxJJPDsdpAHW0EeAooKyc5rp1xhhM60%3D&reserved=0 is an option. — You are receiving this because you were mentioned. Reply to this email directly

From: Eric Czech notifications@github.com Sent: Monday, August 10, 2020 12:08 PM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com Cc: Carl Kadie carlk@msn.com; Mention mention@noreply.github.com Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)

Hey @CarlKCarlKhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965645877&sdata=sqBaai68dP2FhoJn%2BuedXbRih3MaXzElAo%2Fs35LpbR0%3D&reserved=0, I tested this on a single UKB file and had some results to share.

I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-test https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Ftree%2Fmaster%2Fio%2Fbgen%2Fukb-test&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965655878&sdata=KFJnv0cNdP3q9d2OreRZD29XLvyewpFz%2FOYWTxF%2BnOg%3D&reserved=0.

The file is only 4.6G and here are the times I got on a single core:

PyBGEN: ~28 minutes (the .bgi file already exists)
bgen_reader: I stopped it after an hour but it was on track to take ~4 hours
bgen_reader2: ~34 minutes not including the metadata construction time

I did a little more profiling on the new beta in bgen_reader2_prof.ipynbhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnbviewer.jupyter.org%2Fgithub%2Feric-czech%2Fsgkit-dev%2Fblob%2F34dba75a7db899b046d779e7373c4eb2ebd258ca%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.ipynb&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965665867&sdata=QgVmI7N7mrtJ1RvViuNMj4R9w5I7IXELbByS2ouLI3Y%3D&reserved=0. I found that in reading 10k variants of 45,906 after deleting the metafile, I see resource utilization like this:

[prof]https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Fraw%2Fmaster%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.png&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965675861&sdata=lakJ68sPqEUEoOhScoGiMPQbRgnpb65yW6i0guS2XRc%3D&reserved=0

The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable.

Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that?

A few smaller questions:

Have you ever used tqdmhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0? Assuming the custom progress prints are coming from python and not c++, that could make things easier. I noticed a small bug with each print where they seem to be off on the end interval, i.e.

metadata -- time=0:02:41.61, step 3: part 45,000 of 45,906

Found 10000 variants

reading -- time=0:00:34.72, part 990 of 1,000 # --> Never makes it to 1,000

reading -- time=0:00:37.07, part 990 of 1,000

I was doing my scan by grabbing row slices in batches of 1000 variants but it looked like the code was actually pulling them in smaller batches of 10. Each "part N" message was incrementing by 10 from 0 up to 1000. Does it work that way regardless of the file size?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30%23issuecomment-671534711&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965695848&sdata=SsSP0Qz5tGe8ZoR3Yiar5D3a5YXAlJoES7t4SFZCMp4%3D&reserved=0, or unsubscribehttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P2MKYTB4YL64D2PG2TSABAR5ANCNFSM4PFYCRGQ&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965705845&sdata=dUOMyAnSB%2B8snkAI%2BpS87REdw%2FYq6ouDs5%2F3MN1%2F1zs%3D&reserved=0.

CarlKCarlK commented 4 years ago

Update:

Danilo has ideas on making it faster with C/C++ batch readers.

I did the profiling: 84% of time was in the current read, 12% in the open. Everything else was a small %.

Carl

p.s. Notes to myself:

python -m cProfile -o test.profile test_bgen2.py
snakeviz test.profile

From: Carl KADIE Sent: Monday, August 10, 2020 1:31 PM To: limix/bgen-reader-py reply@reply.github.com; limix/bgen-reader-py bgen-reader-py@noreply.github.com; Danilo Horta danilo.horta@pm.me Cc: Mention mention@noreply.github.com Subject: RE: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)

Eric,

Thanks! At the bottom of this node is the inner loop of the Python Bgen2 reader. In summary, it reads one variant at a time with these steps:

Open that variant
If the buffer can be reused (usually the case), do nothing, otherwise allocate a new buffer (for ‘complex’ files)
Read the probability data
Copy results into output array
- (If samples are also being selected, do that, too)
Close the variant.

The Open, Read, and Close methods are @Danilo Hortamailto:danilo.horta@pm.me’s C/C++ code.

I haven’t profiled a run (we should), but I assume most of the time is being spent in lib.bgen_genotype_read.

Carl

p.s. I fixed the logging off-by-one errors and added a TODO to look at tqdmhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0..

From https://github.com/limix/bgen-reader-py/blob/bgen2memmap/bgen_reader/_bgen2.py

    with _log_in_place("reading", self._verbose) as updater:
        for out_index, vaddr0 in enumerate(vaddr):
            if out_index % vaddr_per_second == 0:
                updater("part {0:,} of {1:,}".format(out_index, len(vaddr)))

            genotype = lib.bgen_file_open_genotype(self._bgen._bgen_file, vaddr0)

            if return_probabilities:
                if (
                    prob_buffer is None
                    or ncombinations[out_index] != prob_buffer.shape[-1]
                ):
                    prob_buffer = np.full(  # LATER could offer an option to memmap this buffer
                        (self.nsamples, ncombinations[out_index]),
                        np.nan,
                        order="C",
                        dtype="float64",
                    )
                lib.bgen_genotype_read(
                    genotype, ffi.cast("double *", prob_buffer.ctypes.data)
                )
                val[:, out_index, : ncombinations[out_index]] = (
                    prob_buffer
                    if (samples_index is self._sample_range)
                    else prob_buffer[samples_index, :]
                )

            if return_missings:

[…]

            if return_ploidies:

[…]

            lib.bgen_genotype_close(genotype)

From: Jeff Hammerbacher notifications@github.com<mailto:notifications@github.com> Sent: Monday, August 10, 2020 12:39 PM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com<mailto:bgen-reader-py@noreply.github.com> Cc: Carl Kadie carlk@msn.com<mailto:carlk@msn.com>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)

If the progress prints are coming from C++, https://github.com/p-ranav/indicators https://eur06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fp-ranav%2Findicators&data=02%7C01%7C%7C468ab92806744667e3bd08d83d64fbea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326851211525643&sdata=Fthje7kaMsdQmxJJPDsdpAHW0EeAooKyc5rp1xhhM60%3D&reserved=0 is an option. — You are receiving this because you were mentioned. Reply to this email directly

From: Eric Czech notifications@github.com<mailto:notifications@github.com> Sent: Monday, August 10, 2020 12:08 PM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com<mailto:bgen-reader-py@noreply.github.com> Cc: Carl Kadie carlk@msn.com<mailto:carlk@msn.com>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)

Hey @CarlKCarlKhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965645877&sdata=sqBaai68dP2FhoJn%2BuedXbRih3MaXzElAo%2Fs35LpbR0%3D&reserved=0, I tested this on a single UKB file and had some results to share.

I ran a script for pybgen, bgen_reader, and the new beta to compare times / resource usage on a single XY PAR file scan. The code and the times I recorded are at https://github.com/eric-czech/sgkit-dev/tree/master/io/bgen/ukb-test https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Ftree%2Fmaster%2Fio%2Fbgen%2Fukb-test&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965655878&sdata=KFJnv0cNdP3q9d2OreRZD29XLvyewpFz%2FOYWTxF%2BnOg%3D&reserved=0.

The file is only 4.6G and here are the times I got on a single core:

PyBGEN: ~28 minutes (the .bgi file already exists)
bgen_reader: I stopped it after an hour but it was on track to take ~4 hours
bgen_reader2: ~34 minutes not including the metadata construction time

I did a little more profiling on the new beta in bgen_reader2_prof.ipynbhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnbviewer.jupyter.org%2Fgithub%2Feric-czech%2Fsgkit-dev%2Fblob%2F34dba75a7db899b046d779e7373c4eb2ebd258ca%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.ipynb&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965665867&sdata=QgVmI7N7mrtJ1RvViuNMj4R9w5I7IXELbByS2ouLI3Y%3D&reserved=0. I found that in reading 10k variants of 45,906 after deleting the metafile, I see resource utilization like this:

[prof]https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feric-czech%2Fsgkit-dev%2Fraw%2Fmaster%2Fio%2Fbgen%2Fukb-test%2Fbgen_reader2_prof.png&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965675861&sdata=lakJ68sPqEUEoOhScoGiMPQbRgnpb65yW6i0guS2XRc%3D&reserved=0

The metafile apparently takes about 2m 40s to create, which would be approximately 8% of the overall time. It seems to take a somewhat marginal amount of time compared to the scan, and the cpu/memory usage profiles are otherwise very stable.

Is this about what you would have expected? Our use case would be to traverse the file once and convert it to another format -- is there any relatively simple way to speed up a scan like that?

A few smaller questions:

Have you ever used tqdmhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftqdm%2Ftqdm&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965685853&sdata=2Ck9mdIa37hAV6Z4ComRBKkqFiDfeGda805JpliK1ak%3D&reserved=0? Assuming the custom progress prints are coming from python and not c++, that could make things easier. I noticed a small bug with each print where they seem to be off on the end interval, i.e.

metadata -- time=0:02:41.61, step 3: part 45,000 of 45,906

Found 10000 variants

reading -- time=0:00:34.72, part 990 of 1,000 # --> Never makes it to 1,000

reading -- time=0:00:37.07, part 990 of 1,000

I was doing my scan by grabbing row slices in batches of 1000 variants but it looked like the code was actually pulling them in smaller batches of 10. Each "part N" message was incrementing by 10 from 0 up to 1000. Does it work that way regardless of the file size?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30%23issuecomment-671534711&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965695848&sdata=SsSP0Qz5tGe8ZoR3Yiar5D3a5YXAlJoES7t4SFZCMp4%3D&reserved=0, or unsubscribehttps://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P2MKYTB4YL64D2PG2TSABAR5ANCNFSM4PFYCRGQ&data=02%7C01%7C%7C1aa27b04dcc5462cbb6508d83d60bc54%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326832965705845&sdata=dUOMyAnSB%2B8snkAI%2BpS87REdw%2FYq6ouDs5%2F3MN1%2F1zs%3D&reserved=0.

horta commented 4 years ago

Thanks for the benchmark, @eric-czech !

I have found some bootlenecks on the bgen-reader. Here is a small comparison now:

% python pybgen_exec.py run --path=$HOME/carl/merged_487400x220000.bgen
  0%|          | 778/220000 [00:18<1:28:18, 41.37it/s]

% python bgen_reader_exec.py run --path=$HOME/carl/merged_487400x220000.bgen
  0%|          | 483/220000 [00:10<1:22:19, 44.44it/s]

I have stopped more or less after 10 seconds just to have an idea.

I modified your pybgen_exec.py only to use tqdm:

import fire
import tqdm
from pybgen import PyBGEN

def run(path):
    ct = 0
    with PyBGEN(path) as bgen:
        n = bgen.nb_variants
        for info, dosage in tqdm.tqdm(bgen, total=n):
            assert dosage.ndim == 1
            ct += dosage.size
    print(f"Number of entries read: {ct}")

fire.Fire()

I will try to release tonight a version 4.0.5 for bgen-reader with that speed-up. It would be great if you could find time to run the benchmarks again on that new version (I come back here as soon as I release it).

It will come with the Carl API as well, which might be in the state of experimental but I think that is fine for now.

Notice that I've not used bulk reading (nor have implemented it yet), although reading the variants in order using bgen_reader.read_bgen is faster than in random order.

horta commented 4 years ago

@hammer Thanks for the pointer. I didnt know about that library.

I've coded the low-level of bgen in C and at that time I did not find a nice progress bar library for C only. So I coded mine: https://github.com/horta/almosthere It is very humble compared to the one you showed: ASCII based only.

CarlKCarlK commented 4 years ago

Aside:

I have my own, too 😊 A single 45-lines Python function called “_log_in_place". Here is a usage example. Notice the “i+1”. That fixes the off-by-one error Eric found.

    with _log_in_place("metadata", self._verbose) as updater:
        for i in range(self.nsamples):
            if i % 1000 == 0:
                updater("'sample range': part {0:,} of {1:,}".format(i+1, self.nsamples))
            sample_range_memmap[i] = i

Carl

From: Danilo Horta notifications@github.com Sent: Monday, August 10, 2020 3:02 PM To: limix/bgen-reader-py bgen-reader-py@noreply.github.com Cc: Carl Kadie carlk@msn.com; Mention mention@noreply.github.com Subject: Re: [limix/bgen-reader-py] Feature request: bulk distributed access (#30)

@hammerhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhammer&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302180578&sdata=WZVbsv7QfG6QSjZFnVwEgJMoXwGVOEeRYUZP4V2oS1I%3D&reserved=0 Thanks for the pointer. I didnt know about that library.

I've coded the low-level of bgen in C and at that time I did not find a nice progress bar library for C only. So I coded mine: https://github.com/horta/almosthere https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhorta%2Falmosthere&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302180578&sdata=J9YQZijOG%2F%2FXAAtc1f9TEMZatyYWBvrUDloorhioLBo%3D&reserved=0 It is very humble compared to the one you showed: ASCII based only.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F30%23issuecomment-671611940&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302190564&sdata=G0FrivExKUsOtQ1DRMMmVfaVAkJYA9xucTSHF1kTTdc%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65PYV5BIHIW4FEOFXQSTSABU6BANCNFSM4PFYCRGQ&data=02%7C01%7C%7C74de41b3873447dfb6f608d83d79075c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637326937302190564&sdata=qeFoVB0hyQg1kl0TQRHnHewAHfIZNSPAwbHn7ICpVlk%3D&reserved=0.

horta commented 4 years ago

bgen-reader 4.0.5 has been released: https://pypi.org/project/bgen-reader/

It should be a lot faster now. I will be working on the bulk idea or on the Carl's interface.

eric-czech commented 4 years ago

The Open, Read, and Close methods are @Danilo Hortamailto:danilo.horta@pm.me’s C/C++ code.

Good to know, thanks @CarlKCarlK.

It would be great if you could find time to run the benchmarks again on that new version (I come back here as soon as I release it). bgen-reader 4.0.5 has been released: https://pypi.org/project/bgen-reader/

Beautiful, appreciate the quick turnaround @horta! I'll give that a try again soon.

eric-czech commented 4 years ago

It should be a lot faster now. I will be working on the bulk idea or on the Carl's interface.

Definitely! Here are the times I got with the new code in context with the previous times:

PyBGEN: ~28 minutes (the .bgi file already exists)
bgen_reader 4.0.4: I stopped it after an hour but it was on track to take ~4 hours
bgen_reader 4.0.5: ~21 minutes including metafile construction time
bgen_reader2 in pysnptools: ~34 minutes not including the metadata construction time

Here is the resource utilization profile I saw for bgen_reader 4.0.5 too:

Screen Shot 2020-08-17 at 10 27 09 AM

The low memory usage there is great! I know that seems to imply CPU usage was also 100% but it's actually 100% out of 400% (4 cores on the VM), so it seemed like there wasn't any real benefit to the threading via dask delayed, assuming that's how it still works. I verified by running a big random matrix multiplication which produces a similar graph at 400% CPU utilization across the board. All the related code is in bgen_reader_prof.ipynb.

I'm assuming that the GIL is the culprit with the default Dask scheduler though I'm curious @horta if you did anything special to see higher concurrency when you switched to Dask in the past? I could try with scheduler='processes' but my experience with that is that it usually ends up being pretty close to a wash with scheduler='threads' because of all the serialization for inter-process communication. Let me know if you have any advice.

EDIT: I tried it with the multiprocessing scheduler instead and it was very slow (on track for ~11 hrs) but that's not very fair because I'm asking it to send all the data back to the driver process rather than some reduction over it. It would be great if Dask didn't have this inter-process communication bottleneck -- it keeps coming up!

eric-czech commented 4 years ago

Also out of curiosity @horta I was trying to find some commits related to your changes that made such a huge improvement in 4.0.5. Was there one you think would be good look at to highlight the difference? I was just hoping to get a bit more familiar with how it all works internally. Thanks again for whatever you did!

horta commented 4 years ago

This one: https://github.com/limix/bgen-reader-py/commit/b40fc367bef652b36703ff75e23177c194ddd076 Before, with a dataset having 500k samples, it would read one variant per time. Per time meaning passing a single variant offset to the function read_genotype_partition, called from here: https://github.com/limix/bgen-reader-py/blob/eceda00c6f34fa66882c39e210407e50f5f8ebd8/bgen_reader/_genotype.py#L42

Everytime the function read_genotype_partition is called, a bgen_file is opened (which opens the bgen file, read the header, and ftell). So minimizing that call will minimize that overhead.

There are basically only two overheads that, if minimized, would make everything even faster:

Opening the metafile bgen_metafile
Opening the bgen file bgen_file

If you have a single core to use and want to process the whole bgen file as fast as possible, it would suffice to do something like:

with bgen_metafile(metafile_filepath) as mf:
    with bgen_file(bgen_filepath) as bgen:
        # Loop over variants in sequence

You can do the above very easily by using this package I released last week which will be used by bgen-reader very soon: https://github.com/limix/cbgen-py

I believe Carl's interface try to work along those lines but using memmap from numpy.

horta commented 4 years ago

You will find examples of using cbgen here: https://github.com/limix/cbgen-py/blob/master/cbgen/test/test_bgen.py Or you can wait until we make new releases of bgen-reader.

eric-czech commented 4 years ago

Everytime the function read_genotype_partition is called, a bgen_file is opened (which opens the bgen file, read the header, and ftell). So minimizing that call will minimize that overhead

👍

You can do the above very easily by using this package I released last week which will be used by bgen-reader very soon: https://github.com/limix/cbgen-py

Awesome! Ok, I will give that a spin -- thanks for the examples!

eric-czech commented 4 years ago

Hey @horta I tried using cbgen rather than bgen-reader and everything worked nicely!

As an experiment I was going to try to lift the file open/close logic out of _bgen_file.py#L70-L71 into my own loop but that's a bit out of my depth. I wasn't sure what the offset should be in:

def read_genotype(self, offset: int) -> Genotype:
    gt: CData = lib.bgen_file_open_genotype(self._bgen_file, offset)
    ...
    lib.bgen_genotype_close(gt)

Is there a fairly easy way to do that?

eric-czech commented 4 years ago

Oh and FYI here is a resource profile I had in using cbgen to read a UKB contig:

This is from using a cbgen_reader.py wrapper I made to potentially replace the sgkit-bgen#bgen_reader.py wrapper and I thought the sawtooth memory usage was odd. I didn't see that earlier in https://github.com/limix/bgen-reader-py/issues/30#issuecomment-674923428 when reading the variants in order but I'm not sure what is being "built up" in this case. The wrapper is pre-allocating result arrays so the memory usage shouldn't be so incremental.

Might there be anything in cbgen or bgen_reader that would cause this gradual increase in memory used when reading blocks of variants out-of-order?

horta commented 4 years ago

Hey @eric-czech , the offset for each variant is the result of ftell on the start of the variant block in the bgen file: https://github.com/limix/bgen/blob/a311bcfd32ece93b92d9c2cd9bf964e8d55cb1a6/src/file.c#L145

There is no easy way to get all those offsets. As the bgen file specification stands, you don't know where the variant block start for variant n without knowing where the variant n-1 started (and knowning the size of variant block n-1.

That is one of the reasons index files exist for bgen files.

horta commented 4 years ago

Regarding memory, I don't know. Isn't that just garbage collection by Python? (you can try to use gc.collect() to exclude that out) Also, if that is out-of-order, and assuming that you reading in a uniformly random fashion, why that patteern would be so precise? It looks odd to me too but I've to admit that I don't have much experience on how garbage collection tend to behave. cbgen library does not do anything fancy: allocate numpy arrays, use the C bgen library to copy the requested data to them, and that is over.

eric-czech commented 4 years ago

There is no easy way to get all those offsets. As the bgen file specification stands, you don't know where the variant block start for variant n without knowing where the variant n-1 started (and knowning the size of variant block n-1.

Is it not possible to:

Use the metafile to get the offset to the first variant in a block of variants to be read (block meaning some number of sequential variants)
Do a lib.bgen_file_open_genotype(self._bgen_file, offset) with that offset
Read data for the first variant
"Advance" the open file to the next offset without closing it
Read data for the next variant
... and so on

I have no idea how the "Advance the open file" part would work in cbgen -- is that more difficult than it sounds?

Also, if that is out-of-order, and assuming that you reading in a uniformly random fashion

Mm well what I mean is Dask asks us to fetch some block of adjacent variants (say 100 of them), and there guarantees on which order it will ask for the blocks but they do typically go in order. I.e. for two workers process/thread 1 fetches variants 0 to 100 while process/thread 2 fetches variants 100 to 200 then worker 1 does 200 to 300, etc. "Arbitrary blocks of some size" is probably what I should have called it rather than random.

cbgen library does not do anything fancy: allocate numpy arrays, use the C bgen library to copy the requested data to them, and that is over.

Makes sense. It's probably something in Dask or our wrapper then -- I saw a similar pattern when making a Dask wrapper around pybgen too.

horta commented 4 years ago

I've created an issue to perform "advance the open file". I think it won't change anything: opening variant genotype in sequence should have the same effect as "advancing the open file". Important bit of the C library for that: https://github.com/limix/bgen/blob/a311bcfd32ece93b92d9c2cd9bf964e8d55cb1a6/src/file.c#L145

I indeed perform a fseek each time one opens a variant genotype but I believe the underlying implementation is smart enough to figure out that we are already at the right file position. (I just couldn't find an explicit explanation about that on google)

horta commented 4 years ago

Just to clarify: after you close an openened variant genotype n, the file stream poisition remains at the start position of variant genotype n+1. Which means that subsequently opening variant genotype n+1 and reading it, should have zero overhead regarding file stream position (I'm assuming here that performing fseek that amounts to the current position will have zero overhead.)

eric-czech commented 4 years ago

Just to clarify: after you close an openened variant genotype n, the file stream poisition remains at the start position of variant genotype n+1. Which means that subsequently opening variant genotype n+1 and reading it, should have zero overhead regarding file stream position

Ah ok, thanks @horta.

The biggest killer on reading to a Dask array then, AFAICT, is the lack of sample slicing for chunks. Right now, the way we're reading the data in might say that chunks are of size say 1,000 variants x 10,000 samples, but that still means we have to read all the samples for each variant + chunk (so I'm reading the same variant data upwards of 10 times for UKB).

Do you know if there's some relatively simple way to have cbgen/bgen-reader not deserialize all the unwanted samples into memory? In other words, is it possible to do something like this:

Take a single variant and a subset of adjacent samples to get data for
Seek to the start of the data for the variant
Further seek to the start of the data for the sample block
Scan and read all probabilities until you get to the end of the sample block

Let me know if that doesn't make sense.

tomwhite commented 4 years ago

The biggest killer on reading to a Dask array then, AFAICT, is the lack of sample slicing for chunks.

Another way of tackling this would be to convert to an intermediate chunked format (like zarr). This would be done once, then analysis would be performed on the intermediate format, which would have efficient, chunked read patterns. Perhaps this could be something we recommend for large numbers of samples?

This is the way I am doing it for VCF, which is the only real option as we don't have offset information for the n-th variant, so we need to convert to something like zarr anyway.

eric-czech commented 4 years ago

Another way of tackling this would be to convert to an intermediate chunked format (like zarr).

That's basically what I'm after, a more efficient way to do that conversion. Are you talking about some process for it that doesn't use Dask?

I'd like to think of this as a one-time conversion that we eat the cost of but even now when I'm using ridiculously wide chunks (~1,000 x ~100,000) to avoid some of the repetitive sample reads, it looks it will cost roughly $300 for the vCPU+memory hours (storage aside). That isn't so bad I suppose but I'd like to get those chunks down to more like 10k x 10k, which would hypothetically cost ~5x as much if the write and read workloads are fairly balanced (IO/network bandwidth aside).

tomwhite commented 4 years ago

Are you talking about some process for it that doesn't use Dask?

No, it still uses Dask, but instead of using dask.array.from_array to create a Dask array and then write it to Zarr, use Dask delayed to read ranges of variants in parallel (and all their samples) and write to Zarr directly. This should avoid repetitive sample reads, while still getting chunking in the samples dimension (by setting the Zarr chunks appropriately).

I do something similar for VCF, but it should be simpler for BGEN since you can read variant ranges that align with Zarr chunks, which means you can avoid the concat and rechunk operations that are needed for VCF. See https://github.com/tomwhite/sgkit-vcf/blob/master/sgkit_vcf/vcf_reader.py#L206-L221

eric-czech commented 4 years ago

use Dask delayed to read ranges of variants in parallel (and all their samples) and write to Zarr directly

Ahhh I see what you mean. I would like to think that there is some way to seek between blocks of sample data for different variants in bgen, but I can't imagine how that's possible if compression is on. I'll try a similar bgen_to_zarr function instead.

tomwhite commented 4 years ago

I would like to think that there is some way to seek between blocks of sample data for different variants in bgen, but I can't imagine how that's possible if compression is on.

Exactly. The genotype blocks in BGEN are compressed using zlib or zstd, and so there's no way to avoid decompressing the whole block to only pull out a subset of samples.

horta commented 4 years ago

Just to clarify: after you close an openened variant genotype n, the file stream poisition remains at the start position of variant genotype n+1. Which means that subsequently opening variant genotype n+1 and reading it, should have zero overhead regarding file stream position

Ah ok, thanks @horta.

The biggest killer on reading to a Dask array then, AFAICT, is the lack of sample slicing for chunks. Right now, the way we're reading the data in might say that chunks are of size say 1,000 variants x 10,000 samples, but that still means we have to read all the samples for each variant + chunk (so I'm reading the same variant data upwards of 10 times for UKB).

Do you know if there's some relatively simple way to have cbgen/bgen-reader not deserialize all the unwanted samples into memory? In other words, is it possible to do something like this:
* Take a single variant and a subset of _adjacent_ samples to get data for

* Seek to the start of the data for the variant

* Further seek to the start of the data for the sample block

* Scan and read all probabilities until you get to the end of the sample block
Let me know if that doesn't make sense.

If compressed, the whole structure of probabilities, ploidy, phasedness, missingness, of a single variant that includes all samples has to first me decompressed. So I think the answer to your question is no whenever bgen file is using compression (which I believe will be in most cases.).

If it is not compressed, then it seems possible to just read only the probabilities of the samples your are interested in (https://www.well.ox.ac.uk/~gav/bgen_format/spec/latest.html).

horta commented 4 years ago

Ops, you already figured out the answer. Sorry, didn't read before answering!

limix / bgen-reader-py

Feature request: bulk distributed access #30