Memory - Githubissues

DCousminer commented 4 years ago

I'm wondering if this can be correct-- I got the following R error:

Error: cannot allocate vector of size 31092.0 Gb

Does hyprcoloc really require so much memory? How do you deal with this in a cluster environment?

DCousminer commented 4 years ago

To update, I asked our systems engineer to take a look. He says:

"The program is literally trying to do just as the error explains. I ran it without a scheduler and on a node without cgroup controls so there was nothing to stop it.

Strace output of the moment before the error. Notice the mmap call

mmap(NULL, 33384745127936, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "Error: cannot allocate vector of"..., 49Error: cannot allocate vector of size 31092.0 Gb

In the strace, I see it read this file:

/3.6/hyprcoloc/R/hyprcoloc.rdb

Then try to do some memory mappings, then hit the error. You’ll probably need the developer to take a look."

Hope that helps. Thanks.

jrs95 commented 4 years ago

Hi,

Sorry you are running into difficulties.

To get to the bottom of what is going on here would you be able to answer the following questions:

How many traits are you trying to analyse?
How big is the genomic region you are analysing?
Are you trying to account for trait correlation?

I'm copying in @cnfoley, as he will be able to help here.

Best wishes,

James

cnfoley commented 4 years ago

Hi,

Very glad to hear you are using HyPrColoc, hopefully we can help a little here.

First-off - if possible, it would be great to get an idea re: James' questions.

Some guidance from experience: in situations similar to yours, the problem has (pretty much always) been that there are too many' SNPs in the genomic region for R to cope - a similar outcome will be experienced if there aretoo many' traits. To give you an idea of what `too many' might mean for HyPrColoc, the algorithm has no memory problems when computing colocalization across 1000 traits in a genomic region of 1000 SNPs (see the package vignette for more details). Hence, too many is more than 1000 traits and 1000 SNPs. However, in this example, HyPrColoc takes in two input variables that are quite memory expensive:

"effect.est"
- matrix of estimated regression coefficients where columns denote traits and rows variants.
"effect.se"
- matrix of standard errors corresponding to the matrix of estimated regression coefficients.

In our example, therefore, these matrices contain information on (a) regression coefficients and (b) standard errors for each of the 1000 traits (columns of the matrix) and 1000 SNPs (rows of the matrix), i.e. a matrix of size of 1000*1000. In general, R is OK with this. However, if we now imagine analysing 1 million traits in a region of 1 million SNPs, we have a problem. We do not have enough memory to store such a large object, e.g. if we type into R the following we get an error (similar to yours):

test.matrix = matrix(1, nrow = 1e6, ncol = 1e6); Error: cannot allocate vector of size 7450.6 Gb

My guess is that the genomic region in your analysis is quite large?

Some people have tried to analyse very large genomic regions (e.g. a chromosome) and this (I believe) is not good for a couple of reasons: (i) memory issues, as above, but in theory we can find a work around for this, however the more important issue is; (ii) the larger the genomic window the more allelic heterogeneity (i.e. more causal variants per trait) there is likely to be. In terms of scenario (ii), analysing a genomic region containing multiple independent LD-blocks is far more computationally expensive than chunking up this large genomic region into multiple distinct (non-overlapping) LD-blocks (as we recommend in the HyPrColoc paper and can be achieved using the method proposed herehttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4731402/). We believe that chunking up the region into distinct LD-blocks should (hopefully) not have a big impact on our interpretation of the colocalization results and moreover, as HyPrColoc searches for clusters of traits which share a single causal variant, HyPrColoc should perform better in these - LD defined - regions.

Please let me or James know if you would like any more detail on something, or if this does not fix the issue you are experiencing.

Good luck and best wishes,

Chris

From: James Staley notifications@github.com Sent: 20 November 2019 21:20 To: jrs95/hyprcoloc hyprcoloc@noreply.github.com Cc: Foley, Christopher christopher.foley@mrc-bsu.cam.ac.uk; Mention mention@noreply.github.com Subject: Re: [jrs95/hyprcoloc] Memory (#5)

Hi,

Sorry you are running into difficulties.

To get to the bottom of what is going on here, would you be able to answer the following questions:

How many traits are you trying to analyse?
How big is the genomic region you are analysing?
Are you trying to account for trait correlation?

I'm copying in @cnfoleyhttps://github.com/cnfoley, as he will be able to help here.

Best wishes,

James

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/jrs95/hyprcoloc/issues/5?email_source=notifications&email_token=ALJAQVK5HYWQX2UJ7UQZ2ZLQUWS3XA5CNFSM4JPEFZDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEVC4OQ#issuecomment-556412474, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALJAQVNL7DHG4M7F5544XZ3QUWS3XANCNFSM4JPEFZDA.

DCousminer commented 4 years ago

Gotcha-- yes, my genomic region was very large and I'm working with 6 traits. I will cut it down and try again. Thanks!

DCousminer commented 4 years ago

Works great on a smaller region.

jrs95 / hyprcoloc

Memory #5