heche-psb / wgd

wgd v2: a suite of tools to uncover and date ancient polyploidy and whole-genome duplication
https://wgdv2.readthedocs.io/en/latest/
GNU General Public License v3.0
21 stars 0 forks source link

How to limit memory used by wgd peak? #22

Closed mankiddyman closed 3 months ago

mankiddyman commented 4 months ago

Hello,

I amable to run the wgd pipeline on my local cluster up to the point of wgd viz, but at wgd peak the job always terminates due to high memory consumption. I have attached the .sh file I am using to submit this job as well as the error message from LSF (job_error_msg.txt) and the stderr and stdout of the job (Drosera_aliciae_wgd - Copy.%j). files.zip

Is there a way to limit the memory consumption or otherwise troubleshoot this problem?

heche-psb commented 4 months ago

Hi, thanks for your interest in wgd v2! I did a quick test, which showed that the size of Ks datafile is the determinant of the (maximum) occupied memory while neither the number of EM iterations nor initializations would impact the maximum memory usage. I used two Ks datafile, of Eriobotrya japonica and Vigna angularis respectively, the Ks distribution of which were embedded in the plots. I used the memory_profiler module to sample the memory usage every 0.1s. As you can see below, the run of Eriobotrya japonica reached the maximum occupied memory as 1400MiB and cost around 480s, while the run of Vigna angularis reached the maximum occupied memory as around 500 MiB and cost around 90s. The second run of Vigna angularis was under the setting of --em_iter 10 and --n_init 10, which reached the same maximum occupied memory as around 500 MiB, suggesting that the number of EM iterations or initializations have trivial impact on the maximum occupied memory. If you were dealing with very recent WGD events (which I suppose might be your case), of which the number of anchor pairs would be over, say 20000, it would indeed lead to a very high demand of memory due to factors including the size of pandas dataframe, perhaps a few GB. One possible solution is that you may delete those unused columns in the Ks dataframe, for instance "N", "S", "t", "g1", "g2", "dN/dS", such that the initial occupied memory (and later of the function fit_apgmm_guide) could be declined to some extents. But if the number of anchor pairs is very big, the maximum occupied memory will be accordingly high anyway, because of factors for instance the imported EM algorithm, one of the largest memory consumers. May you try again with the less heavy Ks datafile and see if how it works?

run1 run2 run3