hmsc-r / HMSC

GNU General Public License v3.0
102 stars 37 forks source link

Cholmod error 'problem too large' #97

Open pyspider opened 3 years ago

pyspider commented 3 years ago

Hi, I got this error when I run the hmsc on hpc. Here is the code from this issue https://github.com/hmsc-r/HMSC/issues/95#issuecomment-829089631 Many thanks for your help in advance!

Error in checkForRemoteErrors(val) : 4 nodes produced errors; first error: Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105 Calls: sampleMcmc ... clusterApplyLB -> dynamicClusterApply -> checkForRemoteErrors Execution halted

jarioksa commented 3 years ago

The error comes from the Matrix package that we use for some calculations. Too large a problem means that data dimensions exceed integer maximum. I didn't study what is the integer length and integer maximum used in the Matrix, but in my R the integer maximum is 2147483647 (231-1). Could you please consider running your code with less than 200,000 saved and indexed samples and see if that helps. You can have your preliminary tests in much smaller sample sizes than 200,000, say, with 2,000. This can be used to decide upon needed transient, thin and final size of saved samples.

The error can also be caused by your spatial data model. If your data set is very large, full spatial model can be really big. Your previous issue #95 did not show what kind of spatial model you used. However, you should consider spatial methods for large data sets (NNGP, GPP) if you have a large data set.

pyspider commented 3 years ago

I think I don't use spatial methods. Here is my code for random variables (date & spatial). Does it means I am using a full spatial model?

> S<-rename(S,c('Collection_Date_formatted_1'='date.circular'))
> rL.date = HmscRandomLevel(units = levels(S$date.circular))
> xycoords = as.matrix(cbind(S$UTM.E,S$UTM.N))
> rownames(xycoords)=S$Site_ID
> colnames(xycoords)=c("x-coordinate","y-coordinate")
> unique(xycoords)
#           x-coordinate y-coordinate
#14001          480218.7      2798887
#14002          479622.3      2799269
#18002          478815.2      2798683
#14004          480294.0      2799440
#14003          480410.1      2799691
#18001          479534.7      2797915
#DLJ_14004      429983.9      3061447
#DLJ_14001      428240.0      3062913
#DLJ_18001      431084.4      3062462
#DLJ_18002      430933.9      3061668
#DLJ_14002      429477.3      3063368
#DLJ_14003      429607.7      3062402
#DLJ_14001M     431722.6      3061802
> rL.spatial = HmscRandomLevel(sData = unique(xycoords))
> S<-S[,-6:-7]
> mito.treepl <- read.tree(file.path(data.directory, "mitogenome_treePL_for_hmsc_V6_20210428.newick"))
> is.ultrametric(mito.treepl)
> m1 = Hmsc(Y=Y, XData = X, XFormula = ~ temperature.month.mean + True_Altitude, phyloTree = mito.treepl, studyDesign = S, ranLevels=list(Site_ID=rL.spatial,date.circular=rL.date),distr="lognormal poisson")

Actually, I only have 13 unique spatial coordinates and 521 rows (date) in sData. This data set isn't very large, right? Maybe I need to test with a smaller sample. Thanks!

jarioksa commented 3 years ago

Yes, it means that you are using full spatial model. When you have sData, you have a spatial model, and the default method is "Full". Mere distance matrix is proportional to square of number of spatial units, and calculations still need extra memory. If you have large data, you should use spatial methods for large data sets, such as NNGP or GPP. See the discussion in vignette on spatial models, accessible via browseVignettes("Hmsc") or directly with vignette("vignette_4_spatial", "Hmsc").

jarioksa commented 3 years ago

These numbers for spatial and temporal structures are not large, indeed. However, somewhere you do get so large data structures that Matrix package fails. Matrix is beyond our control, and all we can do (and try to do) is to feed it stuff it can munch. I can't reproduce your problems. One thing that is surprising is your sample size 4 × 200,000. There is no reason to use such large sample sizes, but you probably get better solutions with, say, samples=2000, thin=100 with the same amount of calculations, but more compact result objects.

pyspider commented 3 years ago

These numbers for spatial and temporal structures are not large, indeed. However, somewhere you do get so large data structures that Matrix package fails. Matrix is beyond our control, and all we can do (and try to do) is to feed it stuff it can munch. I can't reproduce your problems. One thing that is surprising is your sample size 4 × 200,000. There is no reason to use such large sample sizes, but you probably get better solutions with, say, samples=2000, thin=100 with the same amount of calculations, but more compact result objects.

Many thanks for your help! I submitted a new hmsc run with a smaller number of thinned independent samples.

#for test
samples_list = c(2000)
thin_list = c(1,100)
nChains = 4
models = list(m1)
modelnames = c("whole_mitogenome_tree")

for(Lst in 1:length(samples_list)){
  for (thin in 1:length(thin_list)){
  thin = thin_list[thin]
  samples = samples_list[Lst]
  print(paste0("thin = ",as.character(thin),"; samples = ",as.character(samples)))
  nm = length(models)
  for (model in 1:nm) {
    print(paste0("model = ",modelnames[model]))
    m = models[[model]]
    m = sampleMcmc(m, thin = thin, samples = samples, adaptNf = rep(ceiling(0.4*thin*samples),m$nr),
           transient = ceiling(0.5*samples*thin), nChains = nChains, nParallel = nChains)
models[[model]] = m}
filename = file.path(model.directory, paste0("model_",modelnames[model],
                                           "_chains_",as.character(nChains),
                                           "_samples_",as.character(samples),
                                           "_thin_",as.character(thin)))
save(models,file = filename)
}}
jarioksa commented 3 years ago

The mixing is very poor in the lognormal Poisson model that you use, and thin=100 may not be sufficient, but you can analyse that with your test runs (convertToCodaObject and the tools in coda). A general piece of advice is to start with short runs, analyse their convergence, and then run final analyses with good looking transient and thin. However, there rarely is need to increase the total number of saved samples over some thousands, and you should rather increase thin (and transient) if there are convergence problems.

pyspider commented 3 years ago

I still got the same error for this hmsc run with a smaller sample (thin = 1; samples = 2000).

setting updater$Gamma2=FALSE due to specified phylogeny matrix Error in checkForRemoteErrors(val) : 4 nodes produced errors; first error: Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105 Calls: sampleMcmc ... clusterApplyLB -> dynamicClusterApply -> checkForRemoteErrors Execution halted

In order to feed hmsc this stuff, I tried to omit the spatial part of the model and submitted the new hmsc run here.

#convert the date into circular factor
S$Collection_Date_formatted_1<-circular(S$Collection_Date_formatted_1)
S<-rename(S,c('Collection_Date_formatted_1'='date.circular'))
rL.date = HmscRandomLevel(units = levels(S$date.circular))
S<-S[,-6:-7]

mito.treepl <- read.tree(file.path(data.directory, "mitogenome_treePL_for_hmsc_V6_20210428.newick"))
is.ultrametric(mito.treepl)

m1 = Hmsc(Y=Y, XData = X, XFormula = ~ temperature.month.mean + True_Altitude, phyloTree = mito.treepl, studyDesign = S, ranLevels=list(date.circular=rL.date),distr="lognormal poisson")

#for test
samples_list = c(2000)
thin_list = c(1,100)
nChains = 4
jarioksa commented 3 years ago

We are working blind: we do not know your data, actually we do not know your models (only some random bits that trickle in your messages). What looks obvious is that the error happens in the Cholesky decomposition of the Matrix package which seems to get data that exceed its capacity (this is the error message). You are obviously running in trouble with your models, and there is no sense to try to launch full scale analysis before you debug the case and fix the problems. So here some hints for the trial & error search:

  1. Do not use parallel processing in your trials: they obscure the error messages as these are not reported directly from the workers. Only use parallel processing after you get your model working.
  2. After you get an error, write traceback(). This gives the sequence of commands that led into error and tells us what is the failing function, and possibly also shows which of your data structures causes problems. Now I assume it was Cholesky decomposition but of what? We use it in at least twelve functions, and with sampleMcmc the first cases may be in computing the data parameters from the phylogenetic tree, and then come many more.
  3. Before sampling, call dump <- computeDataParameters(m1) for your model m1 to see if the initial setting is acceptable.
  4. You can quite well use shorter runs. For instance, samples = 1 is a good starter to help you see if the error is deterministic and happens at the first step or occurs when sampling proceeds to an alien territory.
pyspider commented 3 years ago
  1. Do not use parallel processing in your trials: they obscure the error messages as these are not reported directly from the workers. Only use parallel processing after you get your model working.

Yes. I submitted a traceback run with nChain/nParallel=1 and full spatial model.

  1. After you get an error, write traceback(). This gives the sequence of commands that led into error and tells us what is the failing function, and possibly also shows which of your data structures causes problems. Now I assume it was Cholesky decomposition but of what? We use it in at least twelve functions, and with sampleMcmc the first cases may be in computing the data parameters from the phylogenetic tree, and then come many more.

I wrote traceback() for the above test run.

  1. Before sampling, call dump <- computeDataParameters(m1) for your model m1 to see if the initial setting is acceptable.

Here is the results of computeDataParameters. hmsc_mito_traceback.zip I don't understand the initial setting. Is it acceptable?

  1. You can quite well use shorter runs. For instance, samples = 1 is a good starter to help you see if the error is deterministic and happens at the first step or occurs when sampling proceeds to an alien territory.

It look like the test run is working smoothly (see the attached).

jarioksa commented 3 years ago

Does this mean that it ran smoothly? I did not find any error messages in the output you sent. The point of running computeDataParameters first was only to see that these parameters can be computed without error. It seemed to happen so. It also seems that you ran the short test run smoothly as there were no errors. Now run a longer test (but still non-parallel so that we can trace the possible error and monitor the advance). If there are no errors, you should be save with the same model.

pyspider commented 3 years ago

Does this mean that it ran smoothly? I did not find any error messages in the output you sent. The point of running computeDataParameters first was only to see that these parameters can be computed without error. It seemed to happen so. It also seems that you ran the short test run smoothly as there were no errors. Now run a longer test (but still non-parallel so that we can trace the possible error and monitor the advance). If there are no errors, you should be save with the same model.

Yes. It ran smoothly. I got the test fitted model (samples=1). I ran another test with non-paraller (traceback()) and samples=2000. It is still running.

pyspider commented 3 years ago

Although I wrote the traceback(), I did not find any error messages except the same error in the output (i.e., Cholmod error 'problem too large') after I ran the test (sample=1000 or sample=2000). These test encountered the same error during transient sampling. However, the test with samples=100 ran smoothly. I am not sure how to fix this. I sent you an email with the data and code attached. Many thanks for your help in advance.

jarioksa commented 3 years ago

It seems that you have some large matrices. I think you have something like 2718 taxonomic units, and this can give some very large matrices (phylogenetic correlations are 2718 × 2718 = 7387524). This is still safe for indexing. However, you may run out of memory that R has reserved for you. For instance, function mem.maxVSize() tells you the maximal size of the vector heap in R and allows you to reserve more space. The help page ?Memory tells more about memory handling in R.

Currently Hmsc algorithms use full matrix algebra including frequent matrix inverses, and it seems that you finally end up in this CHOLMOD error exactly in one of these matrix inverses. If you cannot configure your R to use sufficient amount of heap memory, your problem may be just too big for Hmsc in your environment.

pyspider commented 3 years ago

It seems that you have some large matrices. I think you have something like 2718 taxonomic units, and this can give some very large matrices (phylogenetic correlations are 2718 × 2718 = 7387524). This is still safe for indexing. However, you may run out of memory that R has reserved for you. For instance, function mem.maxVSize() tells you the maximal size of the vector heap in R and allows you to reserve more space. The help page ?Memory tells more about memory handling in R.

Currently Hmsc algorithms use full matrix algebra including frequent matrix inverses, and it seems that you finally end up in this CHOLMOD error exactly in one of these matrix inverses. If you cannot configure your R to use sufficient amount of heap memory, your problem may be just too big for Hmsc in your environment.

Here is my R environment. Does it mean that R do not set the memory limit here?

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.5 (Maipo)

Matrix products: default
BLAS/LAPACK: /gpfs/software/OpenBLAS-0.3.9/lib/libopenblas_haswellp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ulimit_0.0-3

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13   magrittr_2.0.1    usethis_2.0.1     devtools_2.4.0
 [5] pkgload_1.2.1     R6_2.5.0          rlang_0.4.10      fastmap_1.1.0
 [9] tools_3.6.3       pkgbuild_1.2.0    sessioninfo_1.1.1 cli_2.4.0
[13] withr_2.4.1       ellipsis_0.3.1    remotes_2.3.0     rprojroot_2.0.2
[17] lifecycle_1.0.0   crayon_1.4.1      processx_3.5.1    purrr_0.3.4
[21] callr_3.6.0       fs_1.5.0          ps_1.6.0          curl_4.3
[25] testthat_3.0.2    memoise_2.0.0     glue_1.4.2        cachem_1.0.4
[29] compiler_3.6.3    desc_1.3.0        prettyunits_1.1.1
> mem.maxVSize()
[1] Inf
> ulimit::memory_limit()
soft hard
 Inf  Inf
jarioksa commented 3 years ago

Inf means that no hard limit is set. In my system, there was a finite mem.maxVSize and I ran into trouble already in computeDataParameters. However, I don't have hardware to test your problem, but you have to find the core of the problems yourself, and it may be that your problem is too large for analysis. It very much looks like a conflict between problem size and computer capacity (and the error message in cholmod_dense.c hints that limit for the 32 bit integers nay have met).

pyspider commented 3 years ago

Thanks! I got the same error while I ran this test with 12T RAM. I guess it is not an out of memory issue. Because the test run had an exit status of zero that indicate successful completion.

[1] "thin = 1; samples = 1000" [1] "model = whole_mitogenome_tree_12T" Chain 1, iteration 20 of 1500 (transient) Chain 1, iteration 40 of 1500 (transient) Chain 1, iteration 60 of 1500 (transient) Chain 1, iteration 80 of 1500 (transient) Chain 1, iteration 100 of 1500 (transient) Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105 Calls: sampleMcmc ... chol.default -> as.matrix -> as.matrix.Matrix -> as -> asMethod Execution halted

Here is the job information on fat nodes with 12T RAM.

jobname=hmsc_trace queue=mem12T ctime=1620369744 qtime=1620369744 etime=1620369744 start=1620369755 ... exec_host=cu54/0 Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1 session=318811 total_execution_slots=1 unique_node_count=1 end=1620392838 Exit_status=0 resources_used.cput=151717 resources_used.energy_used=0 resources_used.mem=201789956kb resources_used.vmem=212644600kb resources_used.walltime=06:24:40

The test without the spatial part is still running (nParallel=4, samples=2000, thin=c(1,100)). The full spatial model caused the very large computational problem. Thus I will omit the spatial part of the model here.

pyspider commented 3 years ago

After the test (nParallel=4, samples=2000, thin=c(1,100)) ran about one week, I encountered this error (the same error?). I would appreciate any suggestions.

[1] "thin = 1; samples = 2000" [1] "model = whole_mitogenome_tree_non_spatial_model" Error in checkForRemoteErrors(val) : 3 nodes produced errors; first error: Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105 Calls: sampleMcmc ... clusterApplyLB -> dynamicClusterApply -> checkForRemoteErrors Execution halted

jarioksa commented 3 years ago

I am afraid I can't help. The message says that the problem is too large: some matrix of yours just exceeds the capacity of the function. This error comes from the Matrix package which is a Recommended (i.e., standard) R package, and it seems to come from this piece of the C code:

if (!ok || nrow > Int_max || ncol > Int_max || nzmax > Int_max)
    {
        ERROR (CHOLMOD_TOO_LARGE, "problem too large") ;
        return (NULL) ;
    }

saying too many rows or too many columns or too many non-zero values or otherwise not ok. Here "too many" means that integer maximum is exceeded. It also looks that ok only checks that there is no integer overflow.

As this error does not occur deterministically at the first sample, it is really difficult to know what is the matrix that causes this error.

jarioksa commented 3 years ago

There is one thing you could try: set a limit for the maximum number of latent factors in random levels. You have a very large number of OTUs and the number of latent factors can become very high, and these together can produce very large matrices that fail in Cholesky decomposition.

You can set the limit with setPriors function for the random level object with, say, rL.Date <- setPriors(rL.Date, nfMax = 10) and then use this modified rL.Date in your Hmsc() definition. If you have managed to run some preliminary runs you can have a look at the number of latent factors there with dim(getPostEstimate(model, "Lambda")$mean . The first dimension is the number of latent factors, and the second number is the number of OTUs. The number 10 that I suggested was just a wild guess – but you can experiment, and if you start with a lower number you can see if this helps.

pyspider commented 3 years ago

There is one thing you could try: set a limit for the maximum number of latent factors in random levels. You have a very large number of OTUs and the number of latent factors can become very high, and these together can produce very large matrices that fail in Cholesky decomposition.

You can set the limit with setPriors function for the random level object with, say, rL.Date <- setPriors(rL.Date, nfMax = 10) and then use this modified rL.Date in your Hmsc() definition. If you have managed to run some preliminary runs you can have a look at the number of latent factors there with dim(getPostEstimate(model, "Lambda")$mean . The first dimension is the number of latent factors, and the second number is the number of OTUs. The number 10 that I suggested was just a wild guess – but you can experiment, and if you start with a lower number you can see if this helps.

Many thanks for help again. Here is the number of latent factors from the models of preliminary runs (samples=10 and samples=100). Can the model increase the number of latent factors as the sampling increases? Thus the test (samples=1000) failed in Cholesky decomposition due to a large number of latent factors. Right?

> load("models/model_whole_mitogenome_tree_chains_4_samples_10_thin_1")
> m<-models[[1]]
> dim(getPostEstimate(m, "Lambda")$mean)
[1]    2 2718
> load("models/model_whole_mitogenome_tree_traceback_chains_1_samples_100_thin_1")
> m<-models[[1]]
> dim(getPostEstimate(m, "Lambda")$mean)
[1]    5 2718 
jarioksa commented 3 years ago

The model can increase the number of latent factors as sampling goes on. The number of latent factors is found during transient and fixed when sampling starts (after transient). However, different chains can end up in different numbers of latent factors. The number of samples used to fix the number of latent factors is controlled by argument adaptNF (defaults to the lengthof transient) in sampleMcmc, and the nfMax in random level (setPriors.HmscRandomLevel) gives the maximum value you can go with each random level.

pyspider commented 3 years ago

The model can increase the number of latent factors as sampling goes on. The number of latent factors is found during transient and fixed when sampling starts (after transient). However, different chains can end up in different numbers of latent factors. The number of samples used to fix the number of latent factors is controlled by argument adaptNF (defaults to the lengthof transient) in sampleMcmc, and the nfMax in random level (setPriors.HmscRandomLevel) gives the maximum value you can go with each random level.

Thanks for your help. Atlernatively, I am trying to reduce the number of OTUs in order to combine them into higher classification levels (i.e., a smaller number of phylogenetic bins instead of entire OTUs). I will continue to do hmsc test if I got a reduced dataset. And I will try to set a limit for the maximum number of latent factors in random levels if I still get the same error.