sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
235 stars 32 forks source link

MemoryError: Allocation failed (probably too large) with ped.compute() #1213

Open fouerghi opened 7 months ago

fouerghi commented 7 months ago

Hi,

I am running a big simulation with a million people and I'd like to calculate the genealogical relatedness between them. I tried running sgkit and unfortunately, when I hit the ped.compute() step, I run into this error: MemoryError: Allocation failed (probably too large).

Any advice on how I can bypass this? I tried doing some stuff related to dask but didn't get anything useful.

P.S: I have a lot of unrelated pairs of individuals in the trees and I am only interested in ones that are above a certain threshold. Could that help? Any suggestions are welcome.

jeromekelleher commented 7 months ago

That's a tricky one - I think all our pedigree algorithms are "dense", and aren't really designed for these kinds of large pedigree. I think @timothymillar had some thoughts about how we might implement though?

timothymillar commented 7 months ago

Hi @fouerghi, yes we currently only support dense arrays. From memory, we decided to avoid sparse arrays until there is more standardized support in our dependencies (Xarray and Zarr).

An alternative that I have considered implementing is chunked computation of kinship/relatedness. This would allow you to compute the relationship matrix in chunks rather than all at once. Theoretically, this should let you identify closely related individuals (in each chunk) without holding the whole matrix in memory. Does this sound like a reasonable solution for your use-case?

timothymillar commented 7 months ago

P.S: I have a lot of unrelated pairs of individuals in the trees and I am only interested in ones that are above a certain threshold. Could that help? Any suggestions are welcome.

Is this a single pedigree in which all individuals are connected, or is it a single dataset with multiple independent pedigrees? If it is the latter, you may be able to split it into multiple smaller pedigrees and perform your computations on those. The pedigree_sel function should be helpful here.

fouerghi commented 7 months ago

Hi Jerome and Tim,

Thank you for getting back to me. Could you elaborate more on the chunking part? I am interested in identifying (full sibs, first cousins, second cousins, third cousins and 4th cousins). Would it work for these cases?

I have a single pedigree where all individuals are connected.

timothymillar commented 7 months ago

Could you elaborate more on the chunking part?

It's a strategy to compute large arrays/matrices in parts which can avoid needing to hold the entire matrix in memory at once. We haven't implemented it for pedigree based kinship estimation yet because it's quite complex. I have written a short script demonstrating how you could do this manually, but it will be quite slow and may require some tuning of the chunk size: https://gist.github.com/timothymillar/2bb2a4521daf948d76808d60eccb390a

I am interested in identifying (full sibs, first cousins, second cousins, third cousins and 4th cousins). Would it work for these cases?

In that case you may not actually need to compute genealogical kinship/relatedness. There are multiple relationship types that can give similar estimates of genealogical relatedness. So it might be better to try and pull these relationships directly from the pedigree. For example, you can identify sets of full sibs based on them having identical parents. But the more distant relations will be more complex.

fouerghi commented 7 months ago

Hi Tim,

For the script you linked above, it uses pedigree_sel() function, but it seems that it's in the unreleased version of sgkit. Is there another function I could use?

timothymillar commented 7 months ago

Hi @fouerghi, sorry for my slow response. I'm afraid we don't have an equivalent function in the current 0.7 release. Hopefully the next release won't be too far away.