hmsc-r / HMSC

GNU General Public License v3.0
104 stars 37 forks source link

Spatial Model running extremely slow #189

Open LamuelCH opened 6 months ago

LamuelCH commented 6 months ago

Hi,

I was running a spatial dataset of size 2,419 sampling units, 17 covariates (8 continuous covariates with 2nd order enabled plus 1 intercept), 3 species, and 1 spatial random level. I found that it is painfully slow to run the original spatial explicit model even running with NNGP.

For thin = 1, samples = 1000, nChains = 4, it takes around four hours for my HPC (96 core 1000gb ram) to complete. For thin = 10 it takes 37 hours. I've read the paper "Computationally efficient joint species distribution modeling of big spatial data" which seems to have a much larger dataset and more restricted computation resources but still can finish within an hour.

Assuming I need thin = 100 or higher to achieve convergence, it will take insanely long for my model to complete, and knowing that n-fold cross validation needs extra n times the original running time is very daunting.

Just wonder if I have defined the model wrongly or is there something I can do to significantly reduce computation time? and is it a good choice to increase the number of chains with fewer samples each (says nChains = 16, samples = 250, thin = 1) to achieve the same number of posterior samples (4000) which I can run in parrallel in HPC to harness its power? Will that increase efficiency? Or should I use the Hmsc-HPC?

# We are now ready to define the HMSC model. 
# here we define the a full model and a single species model.
# To define a spatial random effect at the level of the ID, 
# we need to include the route id:s in the studyDesign

studyDesign = data.frame(ID = XData$ID)

# We next define the random level object. If we would define an unstructured random effect, we would use the
# units = ...  As we wish to define a spatial random effect, we use instead the sData argument.

rL = HmscRandomLevel(sData=xy, sMethod = "NNGP", longlat = TRUE)

# Make sure the longlat = TRUE is enable if the CRS is WGS84 
# Note that the row names of xy correspond to the units of the studyDesign. This is necessary to make
# Hmsc understand how the units of the random effect (rows of xy) correspond to the sampling units
# (rows of studyDesign, Y and XData). While in this example the dimensions of xy and studyDesign match,
# this is not necessarily the case. For example, if we would have included multiple surveys to the same sites,
# each survey would be a row of studyDesign, but sData (or units in case of unstructured random effect)
# should have each site only once.

# As environmental covariates, we include the second order response
# to continous covariates. We use the poly function to defined the second order response. We further use in that
# raw = TRUE to make the model formulation consistent with the way in which predictions over environmental
# gradients are made in Hmsc.

XFormula = ~poly(bio1, degree = 2, raw = TRUE)+poly(bio12, degree = 2, raw = TRUE)+poly(bio4, degree = 2, raw = TRUE)+poly(bio15, degree = 2, raw = TRUE)+poly(roadLength, degree = 2, raw = TRUE)+poly(Regolith, degree = 2, raw = TRUE)+poly(forestCover, degree = 2, raw = TRUE)+poly(foliageCover, degree = 2, raw = TRUE)

# We are now ready to define the model. 
#Note that in ranLevels=list(ID=rL), "ID" refers to a column name of the studyDesign

mFULL = Hmsc(Y=Y, XData = XData, XFormula=XFormula,
             distr="probit", studyDesign=studyDesign, 
             ranLevels=list(ID=rL))

mSINGLE = Hmsc(Y=Y.single, XData = XData, XFormula=XFormula,
         distr="probit", studyDesign=studyDesign, ranLevels=list(ID=rL))

nChains = 4
nParallel = nChains 
samples = 1000
thin = 1

for (thin in c(1, 10, 100, 1000)){

  verbose = 50*thin
  transient = 50*thin

  #########################
  print(date())
  print(paste0("model = FULL"))
  print(paste0("thin = ",as.character(thin),"; samples = ",as.character(samples)))

  mFULL = sampleMcmc(mFULL, thin = thin, samples = samples, transient = transient, verbose = verbose,
                     nChains = nChains, initPar = "fixed effects",
                     nParallel = nParallel)

  filename=file.path(model.directory, paste0("FULL_model_chains_",as.character(nChains),"_samples_",as.character(samples),"_thin_",as.character(thin)))

  save(mFULL,file=filename)

  ########################
  print(date())
  print(paste0("model = SINGLE"))
  print(paste0("thin = ",as.character(thin),"; samples = ",as.character(samples)))

  mSINGLE = sampleMcmc(mSINGLE, thin = thin, samples = samples, transient = transient,verbose = verbose,
                       nChains = nChains, initPar = "fixed effects",
                       nParallel = nParallel)

  filename=file.path(model.directory, paste0("SINGLE_model_chains_",as.character(nChains),"_samples_",as.character(samples),"_thin_",as.character(thin)))
  save(mSINGLE,file=filename)
}

I have also included my complete script and data here untitled folder.zip

*Also a question when making spatial prediction, does the nParrallel need to be equal to the number of chains of my model? nParallel=4 predY.full = predict(mFULL, Gradient=Gradient.full, predictEtaMean = TRUE, expected = TRUE, nParallel=nParallel)

Really appreciate the help !!! 🙏🏻

gtikhonov commented 6 months ago

The runtimes you have observed look somewhat too large. Given that you number of species is small, I would expect that the spatial component does not work sufficiently performant. Using our internal performance comparisons with various datasizes, I would dare to say that it is approximately x10 compared to most equivalent tasks in my laptop, which is no HPC at all.

There are two potential reasons that I have in my mind right now:

  1. If you are running chains in parallel (which you do) some R distribution-OS combinations ae known to start getting convoluted due to interplay between cross-chain paralellization (intended) and within chain default paralllization of called linear algebra routines (not very much intended). Can you check your short runs with nChains=1 or with nParallel=1 and report whether they significantly differ in terms of sec/iteration or not?
  2. NNGP approximation algorithm is sensitive to the order of spatial units. E.g. if you rename your spatial units so that it perturbs their order, the approximation will be different. The rule of thumb is that the order shall be such that there are no neighbours far away in the order. If this is done randomly, then technically NNGP can be equally slow as full covariance GP. HMSC is not handling this aspect automatically. Could you please check if your spatial units are ordered (in term of factor/string values) in somewhat reasonable way, e.g. along the longest axis of your study area?

The nParallel in predictions can be different from its value in the sampling phase.

LamuelCH commented 6 months ago

Thanks heaps!!!

After spatial sorting the observation with the nearest neighbour, the processing speed gained is huge! now with thin = 500 I can finish within 5 hours !! Thanks for keeping my PhD alive :D

But this only works on my personal Mac Studio. If I tried to deploy it on HPC with a Linux system, it seems that spatial sorting does not help. Would you happen to have any ideas? I might need to increase the number of sampling units and species numbers later on which my personal Mac may become a bottleneck.

LamuelCH commented 6 months ago

using parallel = 1, nChains = 1, samples = 250, thin = 1 and transient = 50 when I use my own Mac studio the running time is [1] "MODEL START: Mon May 27 23:45:53 2024" [1] "MODEL END: Mon May 27 23:45:58 2024"

5 seconds

but the same thing on HPC [1] "MODEL START: Mon May 27 23:42:39 2024" [1] "MODEL END: Mon May 27 23:48:22 2024"

nearly 6 minutes

below is my observation row order in my data

image

MartinStjernman commented 5 months ago

Hi, If I may tune in on this, as I have also problems with long running times and am seeking anything that can speed it up, I wonder about the sorting of observations suggested as one solution. 1) What exactly is sorted, is it the XData/studyDesign objects or the object provided as sData when constructing the random level object or both? 2) It seems the improvement reached by LamuelCH when sorting with nearest neighbour used Travelling Salesman Problem (TSP) "algorithm" and I wonder a) is this a good method to satisfy NNGP algorithm requirements and b) what package/function was used to get the ordering according to TSP?

Any help is highly appreciated!

Thanks!

gtikhonov commented 5 months ago

@MartinStjernman you need to sort the names of sData rows, so that its lexicographic order matches the desired one. Personally I typically add some numerical prefix, like 0001_first_site_original_name, 0002_second_site_original_name. Also, you would need to update the corresponding column of studyDesign accordingly.

I am quite sceptical whether TSP is best suited for this problem. First of all, you do not need to return to origin in NNGP scenario. Next, it is not the distance that we are worried about, but that the neighbours are not too far in the resulted order. My guess is that you can simply order along the lon/lat in many cases. Preferably, you shall project to the leading eigenvalue (principal component) of you sites' coordinates.

N = 100
X = cbind(2*runif(N), runif(N))
plot(X[,1], X[,2])
pc <- prcomp(X)
proj = X %*% pc$rotation[,1]
optOrder = rank(proj)
plot(X[,1], X[,2], type="n")
text(X[,1], X[,2], optOrder)

Of course, there are exceptions - if you are studying some coastal communities, then the best way would be to order along the coast.

MartinStjernman commented 5 months ago

Thanks a lot Gleb!

I take it the reason I need to have names of my sites (i.e. rownames in sData), such that its lexicographic order matches the desired one, is that sData is sorted "under the hood" when constructing the random level object using HmscRandomLevel() (i.e. the step: rL$pi = as.factor(sort(rownames(sData)))). I will try this out although I think that my sites are already quite well sorted (site names are "sort of" coordinates). I have, if I may, one additional question. My sites are aggregated in small clusters (cluster is also included as a non-spatial/unstructured random effect) and I have adjusted the alphapw prior for the site random effect to the scale of sites within clusters. With such a "local" prior, is it still of benefit (for speed) to spatially sort the clusters or is it enough for the sites to be spatially sorted within clusters?

Thanks again for the excellent package and help!