malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

RAM use for Af diversity_stats #498

Closed jonbrenas closed 7 months ago

jonbrenas commented 7 months ago

For some reason, running af.diversity_stats uses an enormous amount of RAM (at least on Colab and Datalab). It crashes even my Pro sessions with 12GB of RAM. I tried: af1.diversity_stats( cohorts="cohort_admin2_year", cohort_size=10, region="3RL:1,000,000-2,000,000", sample_query="country == 'Gabon'", site_mask='funestus', site_class="CDS_DEG_4" ) and the same for Ghana (resp. 40 samples in one cohort and 31+29 in 2 cohorts) that fail in the same way.

Trying the same thing with Ag works fine (though, it also crashes due to RAM use if 'site_mask' is dropped). I tested on Gabon with 5 cohorts including 3 that have more than 60 samples.

alimanfoo commented 7 months ago

Thanks Jon, that shouldn't be happening, especially when you're only providing a relatively small genome region for the diversity computation.

alimanfoo commented 7 months ago

The higher memory use is triggered by the site_class parameter:

image

alimanfoo commented 7 months ago

Hi @jonbrenas, just to mention I have a solution to this over in #501.