cord() function on ~3000 species takes a large amount of memory

quantitative-ecologist commented 1 year ago

Good day,

I am fitting a model based ordination with Gaussian copulas using the cord() function on a stackedsdm object. I have around ~3000 bacteria species. I first tried to run the model on my personal computer (32G RAM with only 2 cores) and it completely froze. I am now running the model on a remote computer cluster with 20 cores and 64G RAM, and I see that the memory usage is at the maximum.

Is this normal? I thought that the package would perform well with large data, but it seems it is struggling with that amount of species. Do you know if there is a performance threshold around a certain number of species?

Thank you very much.

quantitative-ecologist commented 1 year ago

I think my issue is also similar to #19.

gordy2x commented 1 year ago

Hi quantitative - ecologist,

Thank you for raising this. The package is pretty bare bones, and not very optimized unfortunately, and I would expect memory issues at some point because we generate n.samp (default 500) sets of matrices the same size as the response to do importance sampling. The fitting works surprisingly well with a small n.samp (even 5), so that would be my first option with large response matrices. I should probably use a different algorithm when the response matrix size gets too big, i'll think on that.

Best, Gordana

On Thu, Oct 27, 2022 at 9:31 AM quantitative-ecologist < @.***> wrote:

Good day,

I am fitting a model based ordination with Gaussian copulas using the cord() function on a stackedsdm object. I have around ~3000 bacteria species. I first tried to run the model on my personal computer (32G RAM with only 2 cores) and it completely froze. I am now running the model on a remote computer cluster with 20 cores and 64G RAM, and I see that the memory used at the maximum.

Is this normal? I thought that the package would perform well with large data, but it seems it is struggling with that amount of species. Do you know if there is a performance threshold around a certain number of species?

Thank you very much.

— Reply to this email directly, view it on GitHub https://github.com/gordy2x/ecoCopula/issues/20, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSZ3QWYO6OY5QC4JIGH6UTWFGWNRANCNFSM6AAAAAARPOQOWY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

quantitative-ecologist commented 1 year ago

Hell @gordy2x,

Thank you very much for your quick response. I will try with a smaller n.samp then. And I think the ecoCopula framework is amazing, and am looking forward to future developments.

Cheers.

Maxime

quantitative-ecologist commented 1 year ago

Good day @gordy2x ,

Just as a follow up to your suggestion to reduce n.samp, it unfortunately does not work for me. I tried with :

n.samp = 10
n.samp = 100
n.samp = 200
n.samp = 500

In the first three instances, I get the following error :

Error in solve.default(cv) :
  system is computationally singular: reciprocal condition number = 4.53686e-22
Calls: cord ... factor_opt -> factanal -> diag -> solve -> solve.default

For n.samp = 500, I still get a huge memory usage, which kills the job on the remote cluster because of insufficient memory (i.e. > 120G RAM)

gordy2x / ecoCopula

cord() function on ~3000 species takes a large amount of memory #20