Open LamuelCH opened 6 months ago
The runtimes you have observed look somewhat too large. Given that you number of species is small, I would expect that the spatial component does not work sufficiently performant. Using our internal performance comparisons with various datasizes, I would dare to say that it is approximately x10 compared to most equivalent tasks in my laptop, which is no HPC at all.
There are two potential reasons that I have in my mind right now:
nChains=1
or with nParallel=1
and report whether they significantly differ in terms of sec/iteration or not?The nParallel
in predictions can be different from its value in the sampling phase.
Thanks heaps!!!
After spatial sorting the observation with the nearest neighbour, the processing speed gained is huge! now with thin = 500 I can finish within 5 hours !! Thanks for keeping my PhD alive :D
But this only works on my personal Mac Studio. If I tried to deploy it on HPC with a Linux system, it seems that spatial sorting does not help. Would you happen to have any ideas? I might need to increase the number of sampling units and species numbers later on which my personal Mac may become a bottleneck.
using parallel = 1, nChains = 1, samples = 250, thin = 1 and transient = 50 when I use my own Mac studio the running time is [1] "MODEL START: Mon May 27 23:45:53 2024" [1] "MODEL END: Mon May 27 23:45:58 2024"
5 seconds
but the same thing on HPC [1] "MODEL START: Mon May 27 23:42:39 2024" [1] "MODEL END: Mon May 27 23:48:22 2024"
nearly 6 minutes
below is my observation row order in my data
Hi, If I may tune in on this, as I have also problems with long running times and am seeking anything that can speed it up, I wonder about the sorting of observations suggested as one solution. 1) What exactly is sorted, is it the XData/studyDesign objects or the object provided as sData when constructing the random level object or both? 2) It seems the improvement reached by LamuelCH when sorting with nearest neighbour used Travelling Salesman Problem (TSP) "algorithm" and I wonder a) is this a good method to satisfy NNGP algorithm requirements and b) what package/function was used to get the ordering according to TSP?
Any help is highly appreciated!
Thanks!
@MartinStjernman you need to sort the names of sData rows, so that its lexicographic order matches the desired one. Personally I typically add some numerical prefix, like 0001_first_site_original_name, 0002_second_site_original_name. Also, you would need to update the corresponding column of studyDesign accordingly.
I am quite sceptical whether TSP is best suited for this problem. First of all, you do not need to return to origin in NNGP scenario. Next, it is not the distance that we are worried about, but that the neighbours are not too far in the resulted order. My guess is that you can simply order along the lon/lat in many cases. Preferably, you shall project to the leading eigenvalue (principal component) of you sites' coordinates.
N = 100
X = cbind(2*runif(N), runif(N))
plot(X[,1], X[,2])
pc <- prcomp(X)
proj = X %*% pc$rotation[,1]
optOrder = rank(proj)
plot(X[,1], X[,2], type="n")
text(X[,1], X[,2], optOrder)
Of course, there are exceptions - if you are studying some coastal communities, then the best way would be to order along the coast.
Thanks a lot Gleb!
I take it the reason I need to have names of my sites (i.e. rownames in sData), such that its lexicographic order matches the desired one, is that sData is sorted "under the hood" when constructing the random level object using HmscRandomLevel() (i.e. the step: rL$pi = as.factor(sort(rownames(sData)))
).
I will try this out although I think that my sites are already quite well sorted (site names are "sort of" coordinates).
I have, if I may, one additional question. My sites are aggregated in small clusters (cluster is also included as a non-spatial/unstructured random effect) and I have adjusted the alphapw prior for the site random effect to the scale of sites within clusters. With such a "local" prior, is it still of benefit (for speed) to spatially sort the clusters or is it enough for the sites to be spatially sorted within clusters?
Thanks again for the excellent package and help!
Hi,
I was running a spatial dataset of size 2,419 sampling units, 17 covariates (8 continuous covariates with 2nd order enabled plus 1 intercept), 3 species, and 1 spatial random level. I found that it is painfully slow to run the original spatial explicit model even running with NNGP.
For thin = 1, samples = 1000, nChains = 4, it takes around four hours for my HPC (96 core 1000gb ram) to complete. For thin = 10 it takes 37 hours. I've read the paper "Computationally efficient joint species distribution modeling of big spatial data" which seems to have a much larger dataset and more restricted computation resources but still can finish within an hour.
Assuming I need thin = 100 or higher to achieve convergence, it will take insanely long for my model to complete, and knowing that n-fold cross validation needs extra n times the original running time is very daunting.
Just wonder if I have defined the model wrongly or is there something I can do to significantly reduce computation time? and is it a good choice to increase the number of chains with fewer samples each (says nChains = 16, samples = 250, thin = 1) to achieve the same number of posterior samples (4000) which I can run in parrallel in HPC to harness its power? Will that increase efficiency? Or should I use the Hmsc-HPC?
I have also included my complete script and data here untitled folder.zip
*Also a question when making spatial prediction, does the nParrallel need to be equal to the number of chains of my model? nParallel=4 predY.full = predict(mFULL, Gradient=Gradient.full, predictEtaMean = TRUE, expected = TRUE, nParallel=nParallel)
Really appreciate the help !!! 🙏🏻