How to limit memory used by wgd peak?

Hi, thanks for your interest in wgd v2! I did a quick test, which showed that the size of Ks datafile is the determinant of the (maximum) occupied memory while neither the number of EM iterations nor initializations would impact the maximum memory usage. I used two Ks datafile, of Eriobotrya japonica and Vigna angularis respectively, the Ks distribution of which were embedded in the plots. I used the memory_profiler module to sample the memory usage every 0.1s. As you can see below, the run of Eriobotrya japonica reached the maximum occupied memory as 1400MiB and cost around 480s, while the run of Vigna angularis reached the maximum occupied memory as around 500 MiB and cost around 90s. The second run of Vigna angularis was under the setting of --em_iter 10 and --n_init 10, which reached the same maximum occupied memory as around 500 MiB, suggesting that the number of EM iterations or initializations have trivial impact on the maximum occupied memory. If you were dealing with very recent WGD events (which I suppose might be your case), of which the number of anchor pairs would be over, say 20000, it would indeed lead to a very high demand of memory due to factors including the size of pandas dataframe, perhaps a few GB. One possible solution is that you may delete those unused columns in the Ks dataframe, for instance "N", "S", "t", "g1", "g2", "dN/dS", such that the initial occupied memory (and later of the function fit_apgmm_guide) could be declined to some extents. But if the number of anchor pairs is very big, the maximum occupied memory will be accordingly high anyway, because of factors for instance the imported EM algorithm, one of the largest memory consumers. May you try again with the less heavy Ks datafile and see if how it works?

run1 run2 run3

heche-psb / wgd

How to limit memory used by wgd peak? #22