stan-dev / rstan

RStan, the R interface to Stan
https://mc-stan.org
1.04k stars 266 forks source link

rstan segfaults constantly #444

Open jpiironen opened 7 years ago

jpiironen commented 7 years ago

Summary:

I constantly run into segfaults when calling rstan programs from my shell scripts.

Description:

I'm running Stan programs on a computing cluster and need to do this by calling the R programs using shell scripts. I constantly run into segfaults when refitting a compiled model using different data. The example below shows a reproducible case (at least on my laptop).

Reproducible Steps:

Here's a toy Stan program called simple.stan:

data {
  int<lower=0> n;
  vector[n] y;
}

parameters {
  real<lower=0> sigma;
  real mu;
}

model {
  y ~ normal(mu, sigma);
}

The model above is being called repeatedly from simpletest.R:

library(rstan)

set.seed(1)
n <- 10
y <- rnorm(n)

for (j in 1:20) {
  print(sprintf('Fit number %d:', j))
  fit <- stan('simple.stan', data=list(n=n,y=y), cores=4)
}

The R-code is being called by the following shell script simplerun.sh:

Rscript simpletest.R

I run the shell script from the command line simply as source simplerun.sh. On my laptop, the program segfaults before refit number 8:

"Fit number 8:"

 *** caught segfault ***
address 0x7f663eac3780, cause 'memory not mapped'

Traceback:
 1: .Call(Module__get_class, pointer, name)
 2: .get_Module_Class(module, demangled_name, xp)
 3: Module(module, mustStart = TRUE)
 4: .getModulePointer(x)
 5: <S4 object of class "Module">$stan_fit4model351e44f524a8_simple
 6: eval(call("$", mod, paste("stan_fit4", model_cppname, sep = "")))
 7: eval(call("$", mod, paste("stan_fit4", model_cppname, sep = "")))
 8: object@mk_cppmodule(object)
 9: .local(object, ...)
10: (function (object, ...) {    standardGeneric("sampling")})(algorithm = "NUTS", chains = 1L, check_data = TRUE, control = NULL,     cores = 1L, data = list(n = 10L, y = c(-0.626453810742332,     0.183643324222082, -0.835628612410047, 1.59528080213779,     0.329507771815361, -0.820468384118015, 0.487429052428485,     0.738324705129217, 0.575781351653492, -0.305388387156356)),     diagnostic_file = NA, include = TRUE, init = "random", iter = 2000,     object = <S4 object of class "stanmodel">, open_progress = FALSE,     pars = NA, sample_file = NA, seed = 821171885L, show_messages = TRUE,     thin = 1, verbose = FALSE, warmup = 1000, check_unknown_args = FALSE,     chain_id = 1L)
11: (function (object, ...) {    standardGeneric("sampling")})(algorithm = "NUTS", chains = 1L, check_data = TRUE, control = NULL,     cores = 1L, data = list(n = 10L, y = c(-0.626453810742332,     0.183643324222082, -0.835628612410047, 1.59528080213779,     0.329507771815361, -0.820468384118015, 0.487429052428485,     0.738324705129217, 0.575781351653492, -0.305388387156356)),     diagnostic_file = NA, include = TRUE, init = "random", iter = 2000,     object = <S4 object of class "stanmodel">, open_progress = FALSE,     pars = NA, sample_file = NA, seed = 821171885L, show_messages = TRUE,     thin = 1, verbose = FALSE, warmup = 1000, check_unknown_args = FALSE,     chain_id = 1L)
12: do.call(rstan::sampling, args = .dotlist)
13: FUN(X[[i]], ...)
14: eval(expr, env)
15: doTryCatch(return(expr), name, parentenv, handler)
16: tryCatchOne(expr, names, parentenv, handlers[[1L]])
17: tryCatchList(expr, classes, parentenv, handlers)
18: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        msg <- conditionMessage(e)        sm <- strsplit(msg, "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && identical(getOption("show.error.messages"),         TRUE)) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
19: try(eval(expr, env), silent = TRUE)
20: sendMaster(try(eval(expr, env), silent = TRUE))
21: mcparallel(FUN(X[[i]], ...), name = names(X)[i], mc.set.seed = mc.set.seed,     silent = mc.silent)
22: FUN(X[[i]], ...)
23: lapply(seq_along(X), function(i) mcparallel(FUN(X[[i]], ...),     name = names(X)[i], mc.set.seed = mc.set.seed, silent = mc.silent))
24: parallel::mclapply(1:chains, FUN = callFun, mc.preschedule = FALSE,     mc.cores = min(chains, cores))
25: .local(object, ...)
26: sampling(sm, data, pars, chains, iter, warmup, thin, seed, init,     check_data = TRUE, sample_file = sample_file, diagnostic_file = diagnostic_file,     verbose = verbose, algorithm = match.arg(algorithm), control = control,     check_unknown_args = FALSE, cores = cores, open_progress = open_progress,     include = include, ...)
27: sampling(sm, data, pars, chains, iter, warmup, thin, seed, init,     check_data = TRUE, sample_file = sample_file, diagnostic_file = diagnostic_file,     verbose = verbose, algorithm = match.arg(algorithm), control = control,     check_unknown_args = FALSE, cores = cores, open_progress = open_progress,     include = include, ...)
28: stan("simple.stan", data = list(n = n, y = y), cores = 4)
An irrecoverable exception occurred. R is aborting now ...

The above gives me segfault only when calling simpletest.R from the shell script, but I have not encountered it when running the R code in RStudio. It also seems that it is related to the use of multiple cores because it does not segfault when I remove the argument cores=4.

Any help is greatly appreciated.

RStan Version:

2.15.1 (tested also on 2.16.2 but it segfaults too)

R Version:

3.4.1 (2017-06-30)

Operating System:

Ubuntu 16.04.2 LTS (Xenial Xerus) 64-bit

bgoodri commented 7 years ago

You need to install Rcpp (possibly to your home folder) with the same C++ compiler and flags as for your models.

On Wed, Aug 16, 2017 at 12:06 PM, Juho Piironen notifications@github.com wrote:

Summary:

I constantly run into segfaults when calling rstan programs from my shell scripts. Description:

I'm running Stan programs on a computing cluster and need to do this by calling the R programs using shell scripts. I constantly run into segfaults when refitting a compiled model using different data. The example below shows a reproducible case (at least on my laptop). Reproducible Steps:

Here's a toy Stan program called simple.stan:

data { int n; vector[n] y; }

parameters { real sigma; real mu; }

model { y ~ normal(mu, sigma); }

The model above is being called repeatedly from simpletest.R:

library(rstan)

set.seed(1) n <- 10 y <- rnorm(n)

for (j in 1:20) { print(sprintf('Fit number %d:', j)) fit <- stan('simple.stan', data=list(n=n,y=y), cores=4) }

The R-code is being called by the following shell script simplerun.sh:

Rscript simpletest.R

I run the shell script from the command line simply as source simplerun.sh. On my laptop, the program segfaults before refit number 8:

"Fit number 8:"

caught segfault address 0x7f663eac3780, cause 'memory not mapped'

Traceback: 1: .Call(Module__get_class, pointer, name) 2: .get_Module_Class(module, demangled_name, xp) 3: Module(module, mustStart = TRUE) 4: .getModulePointer(x) 5: <S4 object of class "Module">$stan_fit4model351e44f524a8_simple 6: eval(call("$", mod, paste("stan_fit4", model_cppname, sep = ""))) 7: eval(call("$", mod, paste("stan_fit4", model_cppname, sep = ""))) 8: object@mk_cppmodule(object) 9: .local(object, ...) 10: (function (object, ...) { standardGeneric("sampling")})(algorithm = "NUTS", chains = 1L, check_data = TRUE, control = NULL, cores = 1L, data = list(n = 10L, y = c(-0.626453810742332, 0.183643324222082, -0.835628612410047, 1.59528080213779, 0.329507771815361, -0.820468384118015, 0.487429052428485, 0.738324705129217, 0.575781351653492, -0.305388387156356)), diagnostic_file = NA, include = TRUE, init = "random", iter = 2000, object = <S4 object of class "stanmodel">, open_progress = FALSE, pars = NA, sample_file = NA, seed = 821171885L, show_messages = TRUE, thin = 1, verbose = FALSE, warmup = 1000, check_unknown_args = FALSE, chain_id = 1L) 11: (function (object, ...) { standardGeneric("sampling")})(algorithm = "NUTS", chains = 1L, check_data = TRUE, control = NULL, cores = 1L, data = list(n = 10L, y = c(-0.626453810742332, 0.183643324222082, -0.835628612410047, 1.59528080213779, 0.329507771815361, -0.820468384118015, 0.487429052428485, 0.738324705129217, 0.575781351653492, -0.305388387156356)), diagnostic_file = NA, include = TRUE, init = "random", iter = 2000, object = <S4 object of class "stanmodel">, open_progress = FALSE, pars = NA, sample_file = NA, seed = 821171885L, show_messages = TRUE, thin = 1, verbose = FALSE, warmup = 1000, check_unknown_args = FALSE, chain_id = 1L) 12: do.call(rstan::sampling, args = .dotlist) 13: FUN(X[[i]], ...) 14: eval(expr, env) 15: doTryCatch(return(expr), name, parentenv, handler) 16: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 17: tryCatchList(expr, classes, parentenv, handlers) 18: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = outFile) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))}) 19: try(eval(expr, env), silent = TRUE) 20: sendMaster(try(eval(expr, env), silent = TRUE)) 21: mcparallel(FUN(X[[i]], ...), name = names(X)[i], mc.set.seed = mc.set.seed, silent = mc.silent) 22: FUN(X[[i]], ...) 23: lapply(seq_along(X), function(i) mcparallel(FUN(X[[i]], ...), name = names(X)[i], mc.set.seed = mc.set.seed, silent = mc.silent)) 24: parallel::mclapply(1:chains, FUN = callFun, mc.preschedule = FALSE, mc.cores = min(chains, cores)) 25: .local(object, ...) 26: sampling(sm, data, pars, chains, iter, warmup, thin, seed, init, check_data = TRUE, sample_file = sample_file, diagnostic_file = diagnostic_file, verbose = verbose, algorithm = match.arg(algorithm), control = control, check_unknown_args = FALSE, cores = cores, open_progress = open_progress, include = include, ...) 27: sampling(sm, data, pars, chains, iter, warmup, thin, seed, init, check_data = TRUE, sample_file = sample_file, diagnostic_file = diagnostic_file, verbose = verbose, algorithm = match.arg(algorithm), control = control, check_unknown_args = FALSE, cores = cores, open_progress = open_progress, include = include, ...) 28: stan("simple.stan", data = list(n = n, y = y), cores = 4) An irrecoverable exception occurred. R is aborting now ...

The above gives me segfault only when calling simpletest.R from the shell script, but I have not encountered it when running the R code in RStudio. It also seems that it is related to the use of multiple cores because it does not segfault when I remove the argument cores=4.

Any help is greatly appreciated. RStan Version:

2.15.1 (tested also on 2.16.2 but it segfaults too) R Version:

3.4.1 (2017-06-30) Operating System:

Ubuntu 16.04.2 LTS (Xenial Xerus) 64-bit

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstan/issues/444, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrquAL3h0gYClU6PGgA6-W26K_pdKhks5sYxObgaJpZM4O5I86 .

jpiironen commented 7 years ago

Any pointers for instructions about how to do this?

bgoodri commented 7 years ago

usually just

install.packages("Rcpp")

if your ~/.R/Makevars file is already configured for Stan.

On Wed, Aug 16, 2017 at 12:19 PM, Juho Piironen notifications@github.com wrote:

Any pointers for instructions about how to do this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstan/issues/444#issuecomment-322824162, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqnjPWq8iW4un9m9kmdHsNQzWON9Tks5sYxaKgaJpZM4O5I86 .

jpiironen commented 7 years ago

My ~/.R/Makevars is configured as instructed on this page https://github.com/stan-dev/rstan/wiki/Installing-RStan-on-Mac-or-Linux. I tried installing Rcpp again using install.packages but it does not remove the segfaults.

bgoodri commented 7 years ago

What compiler version?

On Thu, Aug 17, 2017 at 5:42 AM, Juho Piironen notifications@github.com wrote:

My ~/.R/Makevars is configured as instructed on this page https://github.com/stan-dev/rstan/wiki/Installing-RStan-on-Mac-or-Linux. I tried installing Rcpp again using install.packages but it does not remove the segfaults.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstan/issues/444#issuecomment-323020703, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqppbSKoNGD-6pzCUy7JKZnDJAjiKks5sZAqrgaJpZM4O5I86 .

jpiironen commented 7 years ago

g++ 5.4.0

bgoodri commented 7 years ago

Ahh. I overlooked that you were doing

Rscript simpletest.R

In that case, you have to add --default-packages=Rcpp or put library(Rcpp) into simpletest.R before calling stan().

On Thu, Aug 17, 2017 at 8:54 AM, Juho Piironen notifications@github.com wrote:

g++ 5.4.0

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstan/issues/444#issuecomment-323064812, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqsN06avqqfA7N0IniddVhYXUWBjhks5sZDfsgaJpZM4O5I86 .

jpiironen commented 7 years ago

Tried actually both but no effect, the problem still persists.

bgoodri commented 7 years ago

I can reproduce it on Ubuntu but I have no idea why it is happening. When cores > 1, it calls mclapply() which calls fork(), which apparently causes something bad to happen with some probability.

On Fri, Aug 18, 2017 at 2:18 AM, Juho Piironen notifications@github.com wrote:

Tried actually both but no effect, the problem still persists.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstan/issues/444#issuecomment-323269465, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqrDmNCR78udu_NdZQAZ1oFH0k0Cfks5sZSywgaJpZM4O5I86 .

jpiironen commented 7 years ago

Ok, good to hear that you can reproduce it at least.

avehtari commented 7 years ago

I get random failures also with the latest CmdStan using MatlabStan, but MatlabStan is not giving the whole error and it's random so I can't easily verify that it's the same problem...

syclik commented 7 years ago

I haven't seen CmdStan segfault. Do you have an example you can share? I can try to see if there's a way to reproduce it in CmdStan.

On Aug 19, 2017, at 4:24 AM, Aki Vehtari notifications@github.com wrote:

I get random failures also with the latest CmdStan using MatlabStan, but MatlabStan is not giving the whole error and it's random so I can't easily verify that it's the same problem...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

ahartikainen commented 7 years ago

Is this the same problem?

https://stackoverflow.com/questions/43050763/weird-segfault-in-r-when-using-mclapply-in-linux

bgoodri commented 7 years ago

Quite possibly

On Sat, Sep 16, 2017 at 7:53 PM, Ari Hartikainen notifications@github.com wrote:

Is this the same problem?

https://stackoverflow.com/questions/43050763/weird- segfault-in-r-when-using-mclapply-in-linux

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstan/issues/444#issuecomment-330002227, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqvhVTZ54zXSs2bybyZ4HjFRnj8h7ks5sjF-RgaJpZM4O5I86 .

raerickson commented 6 years ago

I have same problem when I use parallel rstan within a for loop. Would it be helpful if I post my Docker image and a reproducible example?

Here's my versions: Docker Ubuntu: Linux a3a11dc4039c 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 13 10:46:25 EDT 2017 x86_64 GNU/Linux R: version 3.4.3, rstan: rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)

bgoodri commented 6 years ago

We have been able to reproduce it; we just don't know why it occurs with some low probability.

On Wed, Jan 31, 2018 at 3:15 PM, Richard Erickson notifications@github.com wrote:

I have same problem when I use parallel rstan within a for loop. Would be helpful if I post my Docker image and a reproducible example?

Here's my versions: Docker Ubuntu: Linux a3a11dc4039c 3.10.0-693.5.2.el7.x86_64 #1 https://github.com/stan-dev/rstan/pull/1 SMP Fri Oct 13 10:46:25 EDT 2017 x86_64 GNU/Linux R: version 3.4.3, rstan: rstan (Version 2.17.3, GitRev: 2e1f913 https://github.com/stan-dev/rstan/commit/2e1f913d3ca3678128f159b3d17d3d1f9b82704e )

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstan/issues/444#issuecomment-362057110, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqv47GIVK-E-WylZ6dhETnWR0Als1ks5tQMnvgaJpZM4O5I86 .

raerickson commented 6 years ago

@bgoodri Thank you for responding. For me, it occurs every time I try to run the rstan model on a cluster. It occurs around the 6th time through the loop.

bgoodri commented 6 years ago

Yeah, what we don't know is why it works the first 5 times and then eventually does not work.

On Wed, Jan 31, 2018 at 3:49 PM, Richard Erickson notifications@github.com wrote:

@bgoodri https://github.com/bgoodri Thank you for responding. For me, it occurs every time I try to run the rstan model on a cluster. It occurs around the 6th time through the loop.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstan/issues/444#issuecomment-362065905, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqmQjIGZg81hn15vk2UPF6ShSEeftks5tQNHfgaJpZM4O5I86 .

ahartikainen commented 6 years ago

Hi, could you follow your process count?

Open in bash (prints process count and unixtime once every second):

watch -n1 "ps -e | wc -l && date +%s"

To get the number of threads -T could work on Linux (at least it did not work on osx)

watch -n1 "ps -e | wc -l && ps -e -T | wc -l && date +%s"

Also, did anyone try the answer in StackOverflow?

library(inline)
includes <- '#include <sys/wait.h>'
code <- 'int wstat; while (waitpid(-1, &wstat, WNOHANG) > 0) {};'
wait <- cfunction(body=code, includes=includes, convention='.C')

for (j in 1:20) {
    print(sprintf('Fit number %d:', j))
    fit <- stan('simple.stan', data=list(n=n,y=y), cores=4)
    wait()
}
raerickson commented 6 years ago

@ahartikainen I tried the StackOverflow wait fix and it did not work.

Also, I am happy to "watch" the output, but what do I install get the watch function in bash? It's not part of my docker image.

jpiironen commented 6 years ago

The StackOverflow suggestion did not fix this problem for me either, unfortunately.

xlirpu commented 5 years ago

When I run stan calls in a tight loop in the main R environment, I get the same problem. But wrapping the loop in a function and executing it allows me to use multiple cores without error.

sakrejda commented 5 years ago

I don't know if this issue is useful: on a cluster you can get segfaults because your program is not guaranteed to execute on the same machine as it did the first time and many clusters are made from a mix of machines so there's no guarantee you'll get something that's binary compatible. You either need to recompile every time, request from a homogeneous set of nodes, or something similar. There is some stuff about threads, some stuff about an R problem, etc... @bgoodri I'm suggesting we close this issue and try to get more specific details from current use (including the issue @xlirpu brings up if they can include more detail about the environment/model etc...)

jpiironen commented 5 years ago

To clarify, this problem is not related to running the thing on a cluster as it crashes also on my laptop (as I pointed out in my original post). I mentioned the cluster only because that was the reason I had to execute the program the way I described.

This program still gives the same segfault on my laptop with R version 3.4.4 and rstan version 2.18.2.

xlirpu commented 5 years ago

The problem appears during interactive R sessions launched in bash via ssh and so any thread migration should be on the same physical machine. I've worked a minimal case showing that when the stan call loop is wrapped in a function there is no problem but crashes when the loop is executed in R's global environment. This happens only when multiple cores are used. -update: added readable text file for windows-

minimal.R.crash.txt minimal.R.crash.wind.txt

sakrejda commented 5 years ago

@jpiironen thanks for the update/clarification, I missed the laptop bit.

@xlirpu would you be willing to try this with one call outside the loop to compile the model and then within the loop just call "rstan::sampling"?

xlirpu commented 5 years ago

@sakrejda The following ran without error (4 chains crisscrossing in the output so 4 threads I assume). Additionally I extracted the values of the fits and verified that they made sens.

library("Rcpp") library("rstan")

rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectCores())

Repeats <- 1000

N <- 10 data <- rnorm(n=N, mean=10, sd=0.1)

compile if needed

model<-stan_model(file="normal.stan", model_name = "normal")

for(i in 1:Repeats){ print(paste("in loop", i)); rstan::sampling(object=model, data = list(N=N,x=data)) }

sakrejda commented 5 years ago

@xlirpu thanks so much for running that! I'm running your examples now, I think the issue might be that if recompilation is triggered rstan might clobber a dll it's trying to use or something similar... I'll see.

sakrejda commented 5 years ago

tl;dr: @bgoodri when a fit is replaced by another fit under the same name with the same DSO path, StanBuild's finalize isn't called immediately. It's called when the GC collects the old fit. If that happens too late it'll wipe out the current 'fit' object's DSO b/c the DSO is specified by a path shared by the old and the new fit.

my rstan journey: This works fine if you do the loop and repeatedly call stan_model and sampling, it's only the single call 'fit` version that fails.

Here's the bit that's (apparently) unsafe, since it causes the segfault, it's in mk_cppmodule:

 *** caught segfault ***
address 0x7efdf3587cc7, cause 'memory not mapped'

Traceback:
 1: .Call(Module__get_class, pointer, name)
 2: .get_Module_Class(module, demangled_name, xp)
 3: Module(module, mustStart = TRUE)
 4: .getModulePointer(x)
 5: <S4 object of class "Module">$stan_fit4model34655a3348f_normal
 6: eval(call("$", mod, paste("stan_fit4", model_cppname, sep = "")))
 7: eval(call("$", mod, paste("stan_fit4", model_cppname, sep = "")))
 8: object@mk_cppmodule(object)
 9: .local(object, ...)

If you recover from that segfault and poke if manually you get a sefgault (it should just return NULL):

> str(fit@stanmodel@dso@.CXXDSOMISC$module$pointer)

 *** caught segfault ***
address 0x7efdf381d418, cause 'memory not mapped'

Traceback:
 1: .Call(list(name = "CppObject__finalize", address = <pointer: 0x55d734611a50>,     dll = list(name = "Rcpp", path = "/home/krzysztof/R/x86_64-pc-linux-gnu-library/3.4/Rcpp/libs/Rcpp.so",         dynamicLookup = TRUE, handle = <pointer: 0x55d735275fe0>,         info = <pointer: 0x55d7339817f0>), numParameters = 2L),     <pointer: 0x55d739fe71e0>, .pointer)
 2: x$.self$finalize()
 3: (function (x) x$.self$finalize())(<environment>)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

I still can't tell why the module becomes invalid. I wonder if Rcpp's module handling is really safe to call from a multi-core R program, maybe the whole thing would work if we set up the temporary directories ourselves per-core. I haven't explored that at all. One thing wecould do is make mk_cppmodule check validity some other way without touching that pointer... but I don't have a solid way of doing that either.

... it bugs me that 'finalize' crops up in here since one of the things we're doing is replacing the 'fit' object it's not surprising that the 'finalize'

Turns out if you don't save the 'fit object, so the call is just:

 stan(file = 'normal.stan', data = list(N=N, x=data))

rather than

fit <- stan(file = 'normal.stan', data = list(N=N, x=data))

you don't get a segfault so this is a GC-related thing that leads us to

setRefClass("StanBuild",
            contains = "VIRTUAL",
            fields = list(program.name = "character", 
                          shared.object = "character",
                          seed = "integer"),
            # in child classes there is one field for the C++ class and 
            # fields for each element of the data block in a Stan program
            methods = list(
finalize = function() {
  try(dyn.unload(.self$shared.object), silent = TRUE)
  return(invisible(NULL))
},

So when a fit is replaced by a fit under the same name with the same DSO path finalize isn't called immediately but only when the GC collects the old fit. If that happens too late it'll wipe out the current 'fit' object's DSO b/c the DSO is specified by a path.

bob-carpenter commented 5 years ago

Awesome detective work! I think this may be behind other reported segfaults in RStan.

My own preference would be to deprecate the stan() function and have people use stan_model() and sampling() as standard operating procedure. I find it clearer to have explicit rather than implicit operations. On my Mac, I find that using stan() just tells me it's recompiling to avoid segfaults---it never actually reuses a model, so I've just stopped using it.

sakrejda commented 5 years ago

Thanks Bob, I'm just waiting to hear from @bgoodri how he'd like to fix it. Even with separate stan_model() and sampling() calls this same issue could be triggered so we still need a fix. If I'm right about this it could be fixed by generating a new path for the DSO each time a model is compiled and checking that it doesn't conflict with an existing path but Ben might be familiar with simpler solutions.

ahartikainen commented 5 years ago

PyStan adds random string to model name by default. Maybe RStan could something similar?

sakrejda commented 5 years ago

Does PyStan generate a new string even if nothing about the model changes?

bgoodri commented 5 years ago

We don't have a lot of control over the compilation process because it is being handed off to inline::cxxfunction, which generates a random file in the temporary directory rather than taking a filepath argument. We have the avoid_crash function that doesn't dereference an invalid pointer, but maybe we need to be calling it in more circumstances.

sakrejda commented 5 years ago

We could switch to Rcpp::sourceCpp so that we can specify the cache directory.

bgoodri commented 5 years ago

I've been using Rcpp::sourceCpp in rstan3 for a while. It is better in some ways. But I am not sure yet how it would help in this situation. If we didn't try to avoid recompilation, then it wouldn't crash because it was just randomly get a different filename. But we want to avoid recompilation when we can, but have to lookup an existing path to the shared object. Somehow that is becoming unsafe when called in a loop due to GC timing.

On Fri, Dec 7, 2018 at 11:33 AM Krzysztof Sakrejda notifications@github.com wrote:

We could switch to Rcpp::sourceCpp so that we can specify the cache directory.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstan/issues/444#issuecomment-445288458, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqoRW2ePsbX-ww37XD8egXyCvjQU6ks5u2phygaJpZM4O5I86 .

sakrejda commented 5 years ago

The way it's becoming unsafe is that both the just-deleted fit and the new fit share a path to the same DSO so when deletion of the old fit is finalized the new fit's DSO gets deleted. One solution is to separate the cache from the objects. An fit object could check if its model is in the cache using a hash and then copy it out if it finds it. Then when the fit is deleted it can delete its copy but not the copy in the cache. If you want to limit how much memory the cache takes then you can limit how many models it will hold and kick out the oldest model. Since fit objects keep their own copy that's not a problem, at worst it will trigger a recompile. The benefit of using sourceCpp is that it's easy to direct the outputs to subdirectories controlled by the model cache.

riddell-stan commented 5 years ago

@sakrejda PyStan uses the same DSO filename and model_name for the same model_code. Every model DSO gets a unique temporary directory though.

sakrejda commented 5 years ago

Ah ok, that would avoid the problem, that would be the simplest solution for rstan as well but it would be a bummer to loose the caching.

ahartikainen commented 5 years ago

Yes, my mistake, I was thinking of module name.

Thanks for the correction.

omsai commented 5 years ago

For what it's worth, I was able to work around my memory not mapped issue by not changing the setting for options(mc.cores = ...). Thanks to @ahartikainen for helping me think along those lines; the suggested wait() also didn't work for me.

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago)

Matrix products: default
BLAS: /gpfs/gpfs1/apps2/r/3.5.1-gcc540/lib64/R/lib/libRblas.so
LAPACK: /gpfs/gpfs1/apps2/r/3.5.1-gcc540/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] inline_0.3.15      tictoc_1.0         dplyr_0.8.0.1      readr_1.3.1
[5] rstan_2.18.2       StanHeaders_2.18.1 ggplot2_3.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0         magrittr_1.5       hms_0.4.2          tidyselect_0.2.5
 [5] munsell_0.5.0      colorspace_1.4-0   R6_2.4.0           rlang_0.3.1
 [9] plyr_1.8.4         parallel_3.5.1     pkgbuild_1.0.2     grid_3.5.1
[13] gtable_0.2.0       loo_2.0.0          cli_1.0.1          withr_2.1.2
[17] matrixStats_0.54.0 lazyeval_0.2.1     assertthat_0.2.0   tibble_2.0.1
[21] crayon_1.3.4       processx_3.2.1     gridExtra_2.3      purrr_0.3.1
[25] callr_3.1.1        ps_1.3.0           glue_1.3.1         compiler_3.5.1
[29] pillar_1.3.1       prettyunits_1.0.2  scales_1.0.0       stats4_3.5.1
AndersMoelbjerg commented 5 years ago

Any updates on this issue? We experience the 'memory not mapped' from CppObject__finalize often. We are a little different, because our code uses rstan::log_prob and rstan::grad_log_prob on a fit object made with

fit = rstan::sampling(
    object  = object,
    data    = data,
    chains  = 0,
    ...
  )

however, it looks like the same fundamental issue is at cause.

idontgetoutmuch commented 4 years ago

I am getting this problem also and https://github.com/stan-dev/rstan/issues/444#issuecomment-472604654 did not solve it for me.

Adding a blank line to the stan to force recompilation did work.