Closed CarlKCarlK closed 4 years ago
Hi Carl, I've just tried to download it. It says
This item might not exist or is no longer available
This item might have been deleted, expired, or you might not have permission to view it. Contact the owner of this item for more information.
Could you have a look into it?
Sorry about that. Please try this link: https://1drv.ms/u/s!AkoPP4cC5J64tYpDa6YSajjAP8OJ4w?e=5MisNT (I also edited the link above)
Thank you! I'm working on the version 4.0.x of bgen C library: https://github.com/limix/bgen/tree/develop It is a refactoring of the code base and removal of deprecated functions/structs while I check what might be slowing it down.
Hi @CarlKCarlK , I've sorted out slowness for the C library: the C bit now takes ~6% of the total time. (if it where for the C library only to do the job, it would tike 0.7 seconds on my machine, yours seems faster by the way)
~56% of the total time is spent in a very simple python for-loop:
def read_partition(self, index: int):
partition = lib.bgen_metafile_read_partition(self._bgen_metafile, index)
if partition == ffi.NULL:
raise RuntimeError(f"Could not read partition {partition}.")
nvariants = lib.bgen_partition_nvariants(partition)
variants = []
for i in range(nvariants):
variant = lib.bgen_partition_get_variant(partition, i)
id_ = create_string(variant[0].id)
rsid = create_string(variant[0].rsid)
chrom = create_string(variant[0].chrom)
pos = variant[0].position
nalleles = variant[0].nalleles
allele_ids = _read_allele_ids(variant[0].allele_ids, variant[0].nalleles)
offset = variant[0].genotype_offset
variants.append([id_, rsid, chrom, pos, nalleles, allele_ids, offset])
df = DataFrame(
variants,
columns=["id", "rsid", "chrom", "pos", "nalleles", "allele_ids", "vaddr"],
dtype=str,
)
df["pos"] = df["pos"].astype("uint32")
df["nalleles"] = df["nalleles"].astype("uint16")
df["vaddr"] = df["vaddr"].astype("uint64")
index_offset = self.partition_size * index
df.index = range(index_offset, index_offset + nvariants)
lib.bgen_partition_destroy(partition)
return df
And the rest is spent by the pandas dataframe itself (which I cannot do much about).
Sorting out the rest is not difficult: I can code that for-loop in C. I'm afraid that I won't be able to do this task this week as I have spent too many days on that and not on my paid job.
I plan to get back to this final task this weekend or the next weekend. Please, bear with me =)
Great! Good work.
From: Danilo Hortamailto:notifications@github.com Sent: Tuesday, March 17, 2020 6:41 PM To: limix/bgen-reader-pymailto:bgen-reader-py@noreply.github.com Cc: Carl Kadiemailto:carlk@msn.com; Mentionmailto:mention@noreply.github.com Subject: Re: [limix/bgen-reader-py] Variant info performance seems slow as # of variants goes to 1M+ (#23)
Hi @CarlKCarlKhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935338421&sdata=638v%2Fl47%2B5Xd56ZsA6BYTzXPx03otbQkdaIs9nHXB%2Bw%3D&reserved=0 , I've sorted out slowness for the C library: the C bit now takes ~6% of the total time. (if it where for the C library only to do the job, it would tike 0.7 seconds on my machine, yours seems faster by the way)
~56% of the total time is spent in a very simple python for-loop:
def read_partition(self, index: int):
partition = lib.bgen_metafile_read_partition(self._bgen_metafile, index)
if partition == ffi.NULL:
raise RuntimeError(f"Could not read partition {partition}.")
nvariants = lib.bgen_partition_nvariants(partition)
variants = []
for i in range(nvariants):
variant = lib.bgen_partition_get_variant(partition, i)
id_ = create_string(variant[0].id)
rsid = create_string(variant[0].rsid)
chrom = create_string(variant[0].chrom)
pos = variant[0].position
nalleles = variant[0].nalleles
allele_ids = _read_allele_ids(variant[0].allele_ids, variant[0].nalleles)
offset = variant[0].genotype_offset
variants.append([id_, rsid, chrom, pos, nalleles, allele_ids, offset])
df = DataFrame(
variants,
columns=["id", "rsid", "chrom", "pos", "nalleles", "allele_ids", "vaddr"],
dtype=str,
)
df["pos"] = df["pos"].astype("uint32")
df["nalleles"] = df["nalleles"].astype("uint16")
df["vaddr"] = df["vaddr"].astype("uint64")
index_offset = self.partition_size * index
df.index = range(index_offset, index_offset + nvariants)
lib.bgen_partition_destroy(partition)
return df
And the rest is spent by the pandas dataframe itself (which I cannot do much about).
Sorting out the rest is not difficult: I can code that for-loop in C. I'm afraid that I won't be able to do this task this week as I have spent too many days on that and not on my paid job.
I plan to get back to this final task this weekend or the next weekend. Please, bear with me =)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F23%23issuecomment-600381394&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935338421&sdata=ejpf0ETBDtUdLYkEp86nZZNHnt3sBrmCVGlmx%2FzJF%2Bg%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P33RZBGO45ERBYI4Y3RIAREZANCNFSM4K66ZQWA&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935348426&sdata=0v6STUD41ncMGItfewtdySBWmJhJ3jiYOKYz1Qn0Fzo%3D&reserved=0.
By the way, On Friday evening, I sent an email to what I think is your EBI email. (My email is carlk msn.com)
From: Carl KADIEmailto:carlk@msn.com Sent: Tuesday, March 17, 2020 7:05 PM To: limix/bgen-reader-pymailto:reply@reply.github.com; limix/bgen-reader-pymailto:bgen-reader-py@noreply.github.com Cc: Mentionmailto:mention@noreply.github.com Subject: RE: [limix/bgen-reader-py] Variant info performance seems slow as # of variants goes to 1M+ (#23)
Great! Good work.
From: Danilo Hortamailto:notifications@github.com Sent: Tuesday, March 17, 2020 6:41 PM To: limix/bgen-reader-pymailto:bgen-reader-py@noreply.github.com Cc: Carl Kadiemailto:carlk@msn.com; Mentionmailto:mention@noreply.github.com Subject: Re: [limix/bgen-reader-py] Variant info performance seems slow as # of variants goes to 1M+ (#23)
Hi @CarlKCarlKhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935338421&sdata=638v%2Fl47%2B5Xd56ZsA6BYTzXPx03otbQkdaIs9nHXB%2Bw%3D&reserved=0 , I've sorted out slowness for the C library: the C bit now takes ~6% of the total time. (if it where for the C library only to do the job, it would tike 0.7 seconds on my machine, yours seems faster by the way)
~56% of the total time is spent in a very simple python for-loop:
def read_partition(self, index: int):
partition = lib.bgen_metafile_read_partition(self._bgen_metafile, index)
if partition == ffi.NULL:
raise RuntimeError(f"Could not read partition {partition}.")
nvariants = lib.bgen_partition_nvariants(partition)
variants = []
for i in range(nvariants):
variant = lib.bgen_partition_get_variant(partition, i)
id_ = create_string(variant[0].id)
rsid = create_string(variant[0].rsid)
chrom = create_string(variant[0].chrom)
pos = variant[0].position
nalleles = variant[0].nalleles
allele_ids = _read_allele_ids(variant[0].allele_ids, variant[0].nalleles)
offset = variant[0].genotype_offset
variants.append([id_, rsid, chrom, pos, nalleles, allele_ids, offset])
df = DataFrame(
variants,
columns=["id", "rsid", "chrom", "pos", "nalleles", "allele_ids", "vaddr"],
dtype=str,
)
df["pos"] = df["pos"].astype("uint32")
df["nalleles"] = df["nalleles"].astype("uint16")
df["vaddr"] = df["vaddr"].astype("uint64")
index_offset = self.partition_size * index
df.index = range(index_offset, index_offset + nvariants)
lib.bgen_partition_destroy(partition)
return df
And the rest is spent by the pandas dataframe itself (which I cannot do much about).
Sorting out the rest is not difficult: I can code that for-loop in C. I'm afraid that I won't be able to do this task this week as I have spent too many days on that and not on my paid job.
I plan to get back to this final task this weekend or the next weekend. Please, bear with me =)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F23%23issuecomment-600381394&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935338421&sdata=ejpf0ETBDtUdLYkEp86nZZNHnt3sBrmCVGlmx%2FzJF%2Bg%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P33RZBGO45ERBYI4Y3RIAREZANCNFSM4K66ZQWA&data=02%7C01%7C%7C5f298c5dc19c4d984ad908d7cadd7d25%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637200924935348426&sdata=0v6STUD41ncMGItfewtdySBWmJhJ3jiYOKYz1Qn0Fzo%3D&reserved=0.
[Just followed up in a private email.- Carl]
It has improved. Lets see now.
Thanks for creating bgen-reader. It is great! [This "issue" might be a "won't fix". That is fine. There are workarounds. - Carl]
I'm working with a group that needs to access bgen files with 1M+ variants. Some operations that could be almost instantaneous are taking 15 to 30 seconds (and as the # of variants grows, even longer).
Here is my test file: "1x1000000.bgen". It is 1 individual x 1,000,000 variants of random values. [It was created by generating a .gen file and then using "qctool ... -bgen-bits 8 -bgen-compression zlib" to convert to .bgen. Anyone can use this file for any purpose.]
Here are some examples with timings and comments.
In:
Out:
That was as expected.
In:
Out:
That is slower that expected. If the metadata file contained the ids as, e.g., a numpy memmap or an npz, returning 1M values would take almost no time. Asking for chrom has about the same timing.
In:
Out:
Happily, getting a genotype is nice and fast!
In:
Out:
Finally, if I reopen the bgen file. It needs to map the variants again.
In:
Out:
My guess is that the variant information is not included in the metadata file. A fix would be to add it. This would allow bgen-reader to work super nicely on bigger and bigger files.
Carl