Variant info performance seems slow as # of variants goes to 1M+

CarlKCarlK commented 4 years ago

Thanks for creating bgen-reader. It is great! [This "issue" might be a "won't fix". That is fine. There are workarounds. - Carl]

I'm working with a group that needs to access bgen files with 1M+ variants. Some operations that could be almost instantaneous are taking 15 to 30 seconds (and as the # of variants grows, even longer).

Here is my test file: "1x1000000.bgen". It is 1 individual x 1,000,000 variants of random values. [It was created by generating a .gen file and then using "qctool ... -bgen-bits 8 -bgen-compression zlib" to convert to .bgen. Anyone can use this file for any purpose.]

Here are some examples with timings and comments.

In:

%%time

/mnt/m/qctool/build/release/qctool_v2.0.7 -g /mnt/m/deldir/1x1000000.gen -s /mnt/m/deldir/1x1000000.sample -og /mnt/m/deldir/1x1000000.bgen -bgen-bits 8 -bgen-compression zlib

from bgen_reader import read_bgen filename = r'm:\deldir\1x1000000.bgen' bgen = read_bgen(filename,verbose=True)

Out:

We will create the metafile m:\deldir\1x1000000.bgen.metadata. This file will speed up further reads and only need to be created once. So, please, bear with me. Mapping variants: 100%|███████████████████████████████████████████████████| 1000000/1000000 [00:13<00:00, 73777.15it/s] Wall time: 18.9 s

That was as expected.

In:

%%time id_list = bgen['variants']['id'] len(id_list)

Out:

Wall time: 19.2 s 1000000

That is slower that expected. If the metadata file contained the ids as, e.g., a numpy memmap or an npz, returning 1M values would take almost no time. Asking for chrom has about the same timing.

In:

%%time chrom_list = bgen['variants']['chrom'] len(chrom_list)

Out:

Wall time: 16.3 s 1000000

Happily, getting a genotype is nice and fast!

In:

%%time bgen["genotype"][500000].compute()['probs']

Out:

Wall time: 140 ms array([[0.65490196, 0.2 , 0.14509804]])

Finally, if I reopen the bgen file. It needs to map the variants again.

In:

%%time del bgen bgen = read_bgen(filename,verbose=True)

Out:

Mapping variants: 100%|███████████████████████████████████████████████████| 1000000/1000000 [00:14<00:00, 67933.42it/s] Wall time: 15.4 s

My guess is that the variant information is not included in the metadata file. A fix would be to add it. This would allow bgen-reader to work super nicely on bigger and bigger files.

Carl

horta commented 4 years ago

Hi Carl, I've just tried to download it. It says

This item might not exist or is no longer available
This item might have been deleted, expired, or you might not have permission to view it. Contact the owner of this item for more information.

Could you have a look into it?

CarlKCarlK commented 4 years ago

Sorry about that. Please try this link: https://1drv.ms/u/s!AkoPP4cC5J64tYpDa6YSajjAP8OJ4w?e=5MisNT (I also edited the link above)

horta commented 4 years ago

Thank you! I'm working on the version 4.0.x of bgen C library: https://github.com/limix/bgen/tree/develop It is a refactoring of the code base and removal of deprecated functions/structs while I check what might be slowing it down.

horta commented 4 years ago

Hi @CarlKCarlK , I've sorted out slowness for the C library: the C bit now takes ~6% of the total time. (if it where for the C library only to do the job, it would tike 0.7 seconds on my machine, yours seems faster by the way)

~56% of the total time is spent in a very simple python for-loop:

    def read_partition(self, index: int):
        partition = lib.bgen_metafile_read_partition(self._bgen_metafile, index)
        if partition == ffi.NULL:
            raise RuntimeError(f"Could not read partition {partition}.")

        nvariants = lib.bgen_partition_nvariants(partition)
        variants = []
        for i in range(nvariants):
            variant = lib.bgen_partition_get_variant(partition, i)
            id_ = create_string(variant[0].id)
            rsid = create_string(variant[0].rsid)
            chrom = create_string(variant[0].chrom)
            pos = variant[0].position
            nalleles = variant[0].nalleles
            allele_ids = _read_allele_ids(variant[0].allele_ids, variant[0].nalleles)
            offset = variant[0].genotype_offset
            variants.append([id_, rsid, chrom, pos, nalleles, allele_ids, offset])

        df = DataFrame(
            variants,
            columns=["id", "rsid", "chrom", "pos", "nalleles", "allele_ids", "vaddr"],
            dtype=str,
        )

        df["pos"] = df["pos"].astype("uint32")
        df["nalleles"] = df["nalleles"].astype("uint16")
        df["vaddr"] = df["vaddr"].astype("uint64")

        index_offset = self.partition_size * index
        df.index = range(index_offset, index_offset + nvariants)

        lib.bgen_partition_destroy(partition)

        return df

And the rest is spent by the pandas dataframe itself (which I cannot do much about).

Sorting out the rest is not difficult: I can code that for-loop in C. I'm afraid that I won't be able to do this task this week as I have spent too many days on that and not on my paid job.

I plan to get back to this final task this weekend or the next weekend. Please, bear with me =)

CarlKCarlK commented 4 years ago

Great! Good work.

From: Danilo Hortamailto:notifications@github.com Sent: Tuesday, March 17, 2020 6:41 PM To: limix/bgen-reader-pymailto:bgen-reader-py@noreply.github.com Cc: Carl Kadiemailto:carlk@msn.com; Mentionmailto:mention@noreply.github.com Subject: Re: [limix/bgen-reader-py] Variant info performance seems slow as # of variants goes to 1M+ (#23)

Hi @CarlKCarlKhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935338421&sdata=638v%2Fl47%2B5Xd56ZsA6BYTzXPx03otbQkdaIs9nHXB%2Bw%3D&reserved=0 , I've sorted out slowness for the C library: the C bit now takes ~6% of the total time. (if it where for the C library only to do the job, it would tike 0.7 seconds on my machine, yours seems faster by the way)

~56% of the total time is spent in a very simple python for-loop:

def read_partition(self, index: int):

    partition = lib.bgen_metafile_read_partition(self._bgen_metafile, index)

    if partition == ffi.NULL:

        raise RuntimeError(f"Could not read partition {partition}.")

    nvariants = lib.bgen_partition_nvariants(partition)

    variants = []

    for i in range(nvariants):

        variant = lib.bgen_partition_get_variant(partition, i)

        id_ = create_string(variant[0].id)

        rsid = create_string(variant[0].rsid)

        chrom = create_string(variant[0].chrom)

        pos = variant[0].position

        nalleles = variant[0].nalleles

        allele_ids = _read_allele_ids(variant[0].allele_ids, variant[0].nalleles)

        offset = variant[0].genotype_offset

        variants.append([id_, rsid, chrom, pos, nalleles, allele_ids, offset])

    df = DataFrame(

        variants,

        columns=["id", "rsid", "chrom", "pos", "nalleles", "allele_ids", "vaddr"],

        dtype=str,

    )

    df["pos"] = df["pos"].astype("uint32")

    df["nalleles"] = df["nalleles"].astype("uint16")

    df["vaddr"] = df["vaddr"].astype("uint64")

    index_offset = self.partition_size * index

    df.index = range(index_offset, index_offset + nvariants)

    lib.bgen_partition_destroy(partition)

    return df

And the rest is spent by the pandas dataframe itself (which I cannot do much about).

Sorting out the rest is not difficult: I can code that for-loop in C. I'm afraid that I won't be able to do this task this week as I have spent too many days on that and not on my paid job.

I plan to get back to this final task this weekend or the next weekend. Please, bear with me =)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F23%23issuecomment-600381394&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935338421&sdata=ejpf0ETBDtUdLYkEp86nZZNHnt3sBrmCVGlmx%2FzJF%2Bg%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P33RZBGO45ERBYI4Y3RIAREZANCNFSM4K66ZQWA&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935348426&sdata=0v6STUD41ncMGItfewtdySBWmJhJ3jiYOKYz1Qn0Fzo%3D&reserved=0.

CarlKCarlK commented 4 years ago

By the way, On Friday evening, I sent an email to what I think is your EBI email. (My email is carlk msn.com)

From: Carl KADIEmailto:carlk@msn.com Sent: Tuesday, March 17, 2020 7:05 PM To: limix/bgen-reader-pymailto:reply@reply.github.com; limix/bgen-reader-pymailto:bgen-reader-py@noreply.github.com Cc: Mentionmailto:mention@noreply.github.com Subject: RE: [limix/bgen-reader-py] Variant info performance seems slow as # of variants goes to 1M+ (#23)

Great! Good work.

From: Danilo Hortamailto:notifications@github.com Sent: Tuesday, March 17, 2020 6:41 PM To: limix/bgen-reader-pymailto:bgen-reader-py@noreply.github.com Cc: Carl Kadiemailto:carlk@msn.com; Mentionmailto:mention@noreply.github.com Subject: Re: [limix/bgen-reader-py] Variant info performance seems slow as # of variants goes to 1M+ (#23)

Hi @CarlKCarlKhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935338421&sdata=638v%2Fl47%2B5Xd56ZsA6BYTzXPx03otbQkdaIs9nHXB%2Bw%3D&reserved=0 , I've sorted out slowness for the C library: the C bit now takes ~6% of the total time. (if it where for the C library only to do the job, it would tike 0.7 seconds on my machine, yours seems faster by the way)

~56% of the total time is spent in a very simple python for-loop:

def read_partition(self, index: int):

    partition = lib.bgen_metafile_read_partition(self._bgen_metafile, index)

    if partition == ffi.NULL:

        raise RuntimeError(f"Could not read partition {partition}.")

    nvariants = lib.bgen_partition_nvariants(partition)

    variants = []

    for i in range(nvariants):

        variant = lib.bgen_partition_get_variant(partition, i)

        id_ = create_string(variant[0].id)

        rsid = create_string(variant[0].rsid)

        chrom = create_string(variant[0].chrom)

        pos = variant[0].position

        nalleles = variant[0].nalleles

        allele_ids = _read_allele_ids(variant[0].allele_ids, variant[0].nalleles)

        offset = variant[0].genotype_offset

        variants.append([id_, rsid, chrom, pos, nalleles, allele_ids, offset])

    df = DataFrame(

        variants,

        columns=["id", "rsid", "chrom", "pos", "nalleles", "allele_ids", "vaddr"],

        dtype=str,

    )

    df["pos"] = df["pos"].astype("uint32")

    df["nalleles"] = df["nalleles"].astype("uint16")

    df["vaddr"] = df["vaddr"].astype("uint64")

    index_offset = self.partition_size * index

    df.index = range(index_offset, index_offset + nvariants)

    lib.bgen_partition_destroy(partition)

    return df

And the rest is spent by the pandas dataframe itself (which I cannot do much about).

Sorting out the rest is not difficult: I can code that for-loop in C. I'm afraid that I won't be able to do this task this week as I have spent too many days on that and not on my paid job.

I plan to get back to this final task this weekend or the next weekend. Please, bear with me =)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F23%23issuecomment-600381394&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935338421&sdata=ejpf0ETBDtUdLYkEp86nZZNHnt3sBrmCVGlmx%2FzJF%2Bg%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P33RZBGO45ERBYI4Y3RIAREZANCNFSM4K66ZQWA&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935348426&sdata=0v6STUD41ncMGItfewtdySBWmJhJ3jiYOKYz1Qn0Fzo%3D&reserved=0.

CarlKCarlK commented 4 years ago

[Just followed up in a private email.- Carl]

horta commented 4 years ago

It has improved. Lets see now.

limix / bgen-reader-py

Variant info performance seems slow as # of variants goes to 1M+ #23

/mnt/m/qctool/build/release/qctool_v2.0.7 -g /mnt/m/deldir/1x1000000.gen -s /mnt/m/deldir/1x1000000.sample -og /mnt/m/deldir/1x1000000.bgen -bgen-bits 8 -bgen-compression zlib