output csvs not found when using nlrx and clustermq on remote computing cluster

jaymwin commented 3 years ago

Hello,

I've been trying to get one of the example NetLogo models (Wolf Sheep) to run in parallel on our university's computing cluster before attempting to do the same for my own NetLogo model. I can run models nlrx with no problem, but when I try to follow the clustermq example on the Advanced Configuration page I get errors that look like this when I run the clustermq::Q() function:

Submitting 5 worker jobs (ID: nlrx) ...
Running 5 calculations (2 objs/0.3 Mb common; 1 calls/chunk) ...
Master: [2.1s 1.0% CPU]; Worker: [avg 75.3% CPU, max 286.4 Mb]                
Error in summarize_result(job_result, n_errors, n_warnings, cond_msgs,  : 
  5/5 jobs failed (0 warnings). Stopping.
(Error #1) Temporary output file /bsuscratch/jwiniarski/nlrx/temp/nlrx_seed_-581380792_row_1_1b6fefcb74e.csvnot found. On unix systems this can happen if the default system temp folder is used.
                Try reassigning the default temp folder for this R session (unixtools package).
(Error #2) Temporary output file /bsuscratch/jwiniarski/nlrx/temp/nlrx_seed_-581380792_row_2_1b78131ad8f3.csvnot found. On unix systems this can happen if the default system temp folder is used.
                Try reassigning the default temp folder for this R session (unixtools package).
(Error #3) Temporary output file /bsuscratch/jwiniarski/nlrx/temp/nlrx_seed_-581380792_row_3_1b82112baf79.csvnot found. On unix systems this can happen if the default system temp folder is used.
                Try reassigning the default temp folder for this R session (unixtools package).
(Error #4) Temporary output file /bsuscratch/jwiniarski/nlrx/temp/nlrx_seed_-5

I've tried changing the location of the temp directory, but end up with similar errors. I'm connecting to the university's computing cluster and running the following code using the command line (the terminal on my macbook):

# install unixtools
install.packages('unixtools', repos = 'http://www.rforge.net/')

# set java path
Sys.setenv(JAVA_HOME= "/usr/")

# Load packages
library(nlrx)
library(here)

# create output folder
dir.create(here::here('out'))

# create temp folder
dir.create(here::here('temp'))

# set netlogo, output, and model paths
netlogopath <- file.path("/cm/shared/apps/netlogo/6.1.1/")
outpath <- file.path(here::here("out"))
modelpath <- file.path(here::here("Wolf Sheep Predation_nlrx.nlogo"))

# latin hypercube simulation design ---------------------------------------------------------

# set up nl object
nl <- nl(
  nlversion = "6.1.1",
  nlpath = netlogopath,
  modelpath = modelpath,
  jvmmem = 1024 # memory
)

# create experiment
nl@experiment <- experiment(
  expname = "wolf-sheep",
  outpath = outpath,
  repetition = 1,
  tickmetrics = "true",
  idsetup = "setup", 
  idgo = "go",        
  runtime = 500,
  metrics = c(
    "count sheep", 
    "count wolves", 
    "count patches with [pcolor = green]"
  ),
  variables = list(
    "initial-number-sheep" = list(min = 50, max = 150, step = 10, qfun = "qunif"),
    "initial-number-wolves" = list(min = 50, max = 150, step = 10, qfun = "qunif"),
    "grass-regrowth-time" = list(min = 0, max = 100, step = 10, qfun = "qunif"),
    "sheep-gain-from-food" = list(min = 0, max = 50, step = 10, qfun = "qunif"),
    "wolf-gain-from-food" = list(min = 0, max = 100, step = 10, qfun = "qunif"),
    "sheep-reproduce" = list(min = 0, max = 20, step = 5, qfun = "qunif"),
    "wolf-reproduce" = list(min = 0, max = 20, step = 5, qfun = "qunif")
  ),
  constants = list(
    "model-version" = "\"sheep-wolves-grass\"",
    "show-energy?" = "false"
  )
)

# create latin hypercube parameter set
# just do a handful of simulations for this example
nl@simdesign <- simdesign_lhs(nl, samples = 5, nseeds = 1, precision = 3)

# does the model work before trying clustermq?
results_sequential <- run_nl_all(nl)

setsim(nl, "simoutput") <- results_sequential

# Write output to outpath of experiment within nl
write_simoutput(nl)

# now try in parallel ---------------------------------------------------------

library(clustermq)

# set up number of jobs
njobs <- nrow(nl@simdesign@siminput) * length(nl@simdesign@simseeds)

# Second, we generate vectors for looping trough model runs.
# We generate a vector for simpinputrows by repeating the sequence of parameterisations for each seed.
# Then, we generate a vector of random-seeds by repeating each seed for n times, where n is the number of siminputrows.
siminputrows <- rep(seq(1:nrow(nl@simdesign@siminput)), length(nl@simdesign@simseeds))
rndseeds <- rep(nl@simdesign@simseeds, each=nrow(nl@simdesign@siminput))

# Third, we define our simulation function
# Please adjust the path to the temporary file directory
simfun <- function(nl, siminputrow, rndseed, writeRDS=FALSE) {
  unixtools::set.tempdir(here::here('temp')) # what to set here for temp directory?
  library(nlrx)
  res <- run_nl_one(
    nl = nl, 
    siminputrow = siminputrow, 
    seed = rndseed, 
    writeRDS = TRUE
    )
  return(res)
}

# does the simfun work at all?
simfun(nl = nl, siminputrow = siminputrows[1], rndseed = rndseeds[1], writeRDS = TRUE)

# Fourth, use the Q function from the clustermq package to run the jobs on the HPC:
# The q function loops through our siminputrows and rndseeds vectors.
# The function creates njobs jobs and distributes corresponding chunks of the input vectors to these jobs for executing the simulation function simfun.

# As constants we provide our nl object and the function parameter writeRDS. 
# If write RDS is true, an *.rds file will be written on the HPC after each jobs has finished.
# This can be useful to gain results from completed runs after a crash has occured.
results <- clustermq::Q(
  fun = simfun, 
  siminputrow = siminputrows,
  rndseed = rndseeds,
  const = list(
    nl = nl,
    writeRDS = TRUE
  ),
  export = list(), 
  seed = 42, 
  n_jobs = njobs, 
  template = list(
    job_name = "nlrx",
    log.file = "nlrx.log",
    queue = "bsudfq", # name of borah cluster queue
    service = "normal",
    walltime = "00:30:00",
    mem_cpu = "4000"
  )
)  

# The Q function reports the individual results of each job as a list
# Thus, we convert the list results to tibble format:
results <- dplyr::bind_rows(results)

I can't seem to figure out why those csvs can't be located...is this a clustermq issue?

Thanks, and thanks for creating this great R package,

Jay

bitbacchus commented 3 years ago

Dear Jay,

your code is quite hard to read, you probably lost a couple of ` somewhere.

So, just to clarify:

install.packages('unixtools', repos = 'http://www.rforge.net/')
unixtools::set.tempdir("/bsuscratch/jwiniarski/nlrx/temp")

as suggested in #10 did not work?

Best, Sebastian

jaymwin commented 3 years ago

Hi Sebastian,

My apologies, the code is tidied up now. On the 2nd point I was able to install unixtools successfully on the cluster, but regardless of the path in unixtools::set.tempdir we've gotten similar errors as the one shown above.

bitbacchus commented 3 years ago

Hmm, I don't see anything super obvious here - so, I am poking around a bit in the dark: can you login to your cluster frontend node and double-check that you have the correct permissions for the temp folder?

# check permissions of the temp folder
ls -lad /bsuscratch/jwiniarski/nlrx/temp

# are there files in there?
ls -la /bsuscratch/jwiniarski/nlrx/temp

nldoc commented 3 years ago

Dear @jaymwin, thank you for reporting this issue. And thanks @bitbacchus for handling it. I had a look at your example code and it looks fine to me. I think the main problem here is the unspecific error message that I implemented into nlrx. Let me explain, the error message you see is printed whenever the nlrx simulation tries to collect the output results of a simulation but does not find the file. Of course, this can have many different reasons. One of the reasons is the temp folder problem, which is currently mentioned in the error message. However, by the time I realized that this error message might also be somehow misleading, because there are many other problems that can lead to this error which are way more common than the temp folder issue. However, it is also not possible to list all potential failures here in the error message, because it would be to much stuff to print to the console. Thus, I decided to change the error message in the upcoming version of nlrx (I already implemented and pushed some changes today to the github version). So instead of suggesting the tempfolder approach, the error now only tells you that: a) some simulation output was not found and b) that the two possible reasons are that either the simulation did not start or the simulation crashed. c) It also suggests to look at ??run_nl_all() for further help. I added the following tips on debugging this error message to ??run_nl_all():

Debugging "Temporary simulation output file not found" error message:

Whenever this error message appears it means that the simulation did not produce any output. Two main reasons can lead to this problem, either the simulation did not even start or the simulation crashed during runtime. Both can happen for several reasons and here are some hints for debugging this:

Missing software: Make sure that java is installed and available from the terminal (java -version). Make sure that NetLogo is installed and available from the terminal.

Wrong path definitions: Make sure your nlpath points to a folder containing NetLogo. Make sure your modelpath points to a *.nlogo model file. Make sure that the nlversion within your nl object matches the NetLogo version of your nlpath. Use the convenience function of nlrx for checking your nl object (print(nl), eval_variables_constants(nl)).

Temporary files cleanup: Due to automatic temp file cleanup on unix systems temporary output might be deleted. Try reassigning the default temp folder for this R session (the unixtools package has a neat function).

NetLogo runtime crashes: It can happen that your NetLogo model started but failed to produce output because of a NetLogo runtime error. Make sure your model is working correctly or track progress using print statements. Sometimes the java virtual machine crashes due to memory constraints.

So in summary, the error message appears whenever no output was written by the simulation. This can either happen when NetLogo did not start correctly or NetLogo crashed. In your case, as you are using the wolf sheep model, I would assume the model runs correctly. So probably you have some problems either related to java, netlogo or the modelpath. So please verify that your path definitions are correct (in context of the cluster executing your job).

One thing you could try is to verify that NetLogo runs from the command line on your cluster. Just login to a terminal running on your cluster and try to manually start a NetLogo simulation as explained in the NetLogo User Manual. If this works, chances are good that they also work from nlrx, as long as you define nlpath and modelpath correctly.

Hope this helps to track down the issue. nldoc

jaymwin commented 3 years ago

Thanks for elaborating on the possible issues underlying the error message, this is all very helpful. I believe paths to NetLogo, java, and the wolf sheep model are all set properly, as I can run simulations from the command line without trying clustermq.

My guess is that the error is specific to the cluster, and maybe the temporary folder we are using there? @bitbacchus I assume I have permissions for the temp folder, but I have not tried to login to the cluster frontend node yet. That is something I could try with the help of the university's research computing folks. If I figure the error out I'll be sure to mention it here.

Thanks, Jay

edmizac commented 1 year ago

Complementing this post...

I got the same error when running my model with run_nl_dyn(), but not with run_nl_all(). I circunvented this issue by executing RStudio as admin (Windows 10).

Thanks,

ropensci / nlrx

output csvs not found when using nlrx and clustermq on remote computing cluster #48