Irrecoverable execptions during worker startups

pat-s commented 4 years ago

I am spawning > 100 workers via drake.

For some, I constantly get the following

worker log

```r During startup - Warning messages: 1: Setting LC_CTYPE failed, using "C" 2: Setting LC_COLLATE failed, using "C" 3: Setting LC_TIME failed, using "C" 4: Setting LC_MESSAGES failed, using "C" 5: Setting LC_MONETARY failed, using "C" 6: Setting LC_PAPER failed, using "C" 7: Setting LC_MEASUREMENT failed, using "C" > clustermq:::worker("tcp://edi:7848") 2019-11-16 11:19:49.663881 | Master: tcp://edi:7848 2019-11-16 11:19:49.880703 | WORKER_UP to: tcp://edi:7848 2019-11-16 11:19:51.230083 | > DO_SETUP (1.084s wait) 2019-11-16 11:19:51.234778 | token from msg: set_common_data_token 2019-11-16 11:20:45.251626 | > DO_CALL (5.778s wait) .rgdal: version: 1.4-4, (SVN revision 833) Geospatial Data Abstraction Library extensions to R successfully loaded Loaded GDAL runtime: GDAL 2.4.2, released 2019/06/28 Path to GDAL shared files: /opt/spack/opt/spack/linux-centos7-x86_64/gcc-9.2.0/gdal-2.4.2-henhg265ta2tkqlfzzk3eukoet6sywpy/share/gdal GDAL binary built with GEOS: TRUE Loaded PROJ.4 runtime: Rel. 5.2.0, September 15th, 2018, [PJ_VERSION: 520] Path to PROJ.4 shared files: (autodetected) Linking to sp version: 1.3-1 Attaching package: 'signal' The following object is masked from 'package:raster': resample The following objects are masked from 'package:stats': filter, poly Attaching package: 'caret' The following object is masked from 'package:drake': progress ################################### This is hsdar 0.5.2 To get citation entry type 'citation("hsdar")' ################################### Attaching package: 'hsdar' The following object is masked from 'package:raster': nbands Attaching package: 'dplyr' The following object is masked from 'package:signal': filter The following objects are masked from 'package:raster': intersect, select, union The following objects are masked from 'package:stats': filter, lag The following objects are masked from 'package:base': intersect, setdiff, setequal, union Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0 Attaching package: 'purrr' The following object is masked from 'package:caret': lift Attaching package: 'glue' The following object is masked from 'package:dplyr': collapse The following object is masked from 'package:raster': trim Registered S3 method overwritten by 'R.oo': method from throw.default R.methodsS3 Attaching package: 'R.oo' The following object is masked from 'package:glue': trim The following objects are masked from 'package:raster': extend, trim The following object is masked from 'package:rgdal': getDescription The following object is masked from 'package:drake': check The following objects are masked from 'package:methods': getClasses, getMethods The following objects are masked from 'package:base': attach, detach, gc, load, save Attaching package: 'R.utils' The following objects are masked from 'package:signal': resample, unwrap The following objects are masked from 'package:raster': extract, resample The following object is masked from 'package:drake': evaluate The following object is masked from 'package:utils': timestamp The following objects are masked from 'package:base': cat, commandArgs, getOption, inherits, isOpen, nullfile, parse, warnings Attaching package: 'future' The following object is masked from 'package:caret': cluster The following object is masked from 'package:raster': values The following object is masked from 'package:drake': plan Attaching package: 'magrittr' The following object is masked from 'package:R.utils': extract The following object is masked from 'package:R.oo': equals The following object is masked from 'package:purrr': set_names The following object is masked from 'package:raster': extract Attaching package: 'data.table' The following object is masked from 'package:purrr': transpose The following objects are masked from 'package:dplyr': between, first, last The following object is masked from 'package:raster': shift Attaching package: 'ParamHelpers' The following object is masked from 'package:R.utils': isVector The following object is masked from 'package:raster': getValues Attaching package: 'mlr' The following objects are masked from 'package:R.utils': resample, setThreshold The following object is masked from 'package:hsdar': train The following object is masked from 'package:caret': train The following object is masked from 'package:signal': resample The following object is masked from 'package:raster': resample Attaching package: 'BBmisc' The following objects are masked from 'package:R.utils': getRelativePath, insert, isDirectory, setValue The following object is masked from 'package:glue': collapse The following objects are masked from 'package:dplyr': coalesce, collapse The following object is masked from 'package:base': isFALSE Attaching package: 'checkmate' The following object is masked from 'package:R.utils': asInt Attaching package: 'smoof' The following objects are masked from 'package:R.oo': getDescription, getName The following object is masked from 'package:rgdal': getDescription Attaching package: 'emoa' The following object is masked from 'package:BBmisc': coalesce The following object is masked from 'package:dplyr': coalesce ## rgenoud (Version 5.8-3.0, Build Date: 2019-01-22) ## See http://sekhon.berkeley.edu/rgenoud for additional documentation. ## Please cite software as: ## Walter Mebane, Jr. and Jasjeet S. Sekhon. 2011. ## ``Genetic Optimization Using Derivatives: The rgenoud package for R.'' ## Journal of Statistical Software, 42(11): 1-26. ## Attaching package: 'gdalUtils' The following object is masked from 'package:sf': gdal_rasterize Attaching package: 'ggpubr' The following object is masked from 'package:raster': rotate here() starts at /home/patrick/papers/2019-feature-selection This is workflowr version 1.5.0 Run ?workflowr for help getting started Attaching package: 'survival' The following object is masked from 'package:future': cluster The following object is masked from 'package:caret': cluster Attaching package: 'igraph' The following object is masked from 'package:BBmisc': normalize The following object is masked from 'package:fs': path The following objects are masked from 'package:future': %->%, %<-% The following object is masked from 'package:R.oo': hierarchy The following objects are masked from 'package:purrr': compose, simplify The following objects are masked from 'package:dplyr': as_data_frame, groups, union The following object is masked from 'package:raster': union The following object is masked from 'package:drake': read_graph The following objects are masked from 'package:stats': decompose, spectrum The following object is masked from 'package:base': union Attaching package: 'mRMRe' The following object is masked from 'package:drake': target Attaching package: 'kernlab' The following object is masked from 'package:purrr': cross The following object is masked from 'package:ggplot2': alpha The following objects are masked from 'package:raster': buffer, rotated ... *** caught illegal operation *** address 0x28f71bd6, cause 'unknown' Traceback: 1: crossprod(m) 2: psmall.svd(m, tol) 3: fast.svd(xsw) 4: estimate.lambda(cbind(Ytrain, Xtrain), verbose = verbose) 5: care::carscore(Xtrain = data$data, Ytrain = data$target, verbose = FALSE, ...) 6: (function (task, nselect, ...) { data = getTaskData(task, target.extra = TRUE) y = care::carscore(Xtrain = data$data, Ytrain = data$target, verbose = FALSE, ...)^2 setNames(as.double(y), names(y))})(task = list(type = "regr", env = , weights = NULL, blocking = c(4L, 1L, 4L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 4L, 1L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 1L, 1L, 4L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 1L, 4800006.123, 4799983.79455825, 4790891, 4790878, 4790891, 4790884, 4790895, 4790847, 4790851, 4800018.549, 4799991.09, 4790850, 4790868, 4800003.949, 4799977.83623136, 4800011.192, 4790874, 4799992.928, 4790849, 4790872, 4790866, 4800007.214, 4790871, 4790887, 4800007.242, 4790898, 4799988.226, 4800018.534, 4790888, 4790865, 4800003.934, 4799997.893, 4790876, 4790854, 4799986.58674491, 4790859, 4790886, 4790896, 4799982.09930207, 4790898, 4800016.575, 4799977.46227779, 4790849, 4790884, 4799985.638, 4800012.07813886, 4800019.927, 4799990.505, 4799986.264, 4799976.86395208, 4790869, 4790891, 4790888, 4790878, 4800000.008, 4790856, 4799988.98004777, 4799981.655, 4790902, 4790863, 4790877, 4800011.524, 4800010.44, 4800008.156, 4799979.406, 4790846, 4799993.102, 4790869, 4790880, 4800018.0356031, 4790887, 4799984.56739563, 4800004.887, 4790884, 4790862, 4790866, 4799994.051, 4800015.855, 4790874, 4799988.058, 4799977.752, 4790891, 4790850, 4790842, 4799984.141, 4799986.58674491, 4799983.097, 4799975.7670216, 4800012.06, 4790854, 4799997.63084037, 4800016.885, 4799980.3541854, 4799982.854, 4790884, 4799999.899, 4799974.81967256, 4790842, 4799980.05502255, 4800007.623, 4790868, 4799978.70878969, 4799994.247, 4800003.127, 4799996.069, 4790826, 4799994.451, 4790854, 4790900, 4799995.299, 4799995.211, 4790863, 4799991.2, 4790875, 4799996.069, 4790835, 4790856, 4790830, 4790846, 4800022.44714308, 4790881, 4799988.43158253, 4790835, 4790896, 4800008.766, 4799979.70599921, 4800006.492, 4800008.629, 4800004.358, 4790844, 4790891, 4790877, 4790862, 4790891, 4799995.204, 4799996.65856108, 4800001.712, 4800012.283, 4790897, 4790860, 4790861, 4799977.06339398, 4790837, 4799979.568 )), task.desc = list(id = "defoliation-all-plots-HR", type = "regr", target = "defoliation", size = 1008L, n.feat = c(numerics = 122L, factors = 0L, ordered = 0L, functionals = 0L), has.missings = FALSE, has.weights = FALSE, has.blocking = TRUE, has.coordinates = TRUE)), nselect = 122L) 7: do.call(x$fun, c(list(task = task, nselect = nselect), more.args[[x$name]])) 8: FUN(X[[i]], ...) 9: lapply(filter, function(x) { x = do.call(x$fun, c(list(task = task, nselect = nselect), more.args[[x$name]])) missing.score = setdiff(fn, names(x)) x[missing.score] = NA_real_ x[match(fn, names(x))]}) 10: `_f`(task = task, method = method, nselect = nselect, ... = ..., more.args = more.args) 11: withVisible(`_f`(task = task, method = method, nselect = nselect, ... = ..., more.args = more.args)) 12: generateFVData(task = task, method = method, nselect = getTaskNFeats(task), ...) 13: (function (task, method = "randomForestSRC_importance", fval = NULL, perc = NULL, abs = NULL, threshold = NULL, mandatory.feat = NULL, select.method = NULL, base.methods = NULL, cache = FALSE, ...) { assertClass(task, "SupervisedTask") if (is.list(base.methods)) { base.methods = as.character(base.methods) } assertChoice(method, choices = append(ls(.FilterRegister), ls(.FilterEnsembleRegister))) if (method %in% ls(.FilterEnsembleRegister) && !is.null(base.methods)) { if (length(base.methods) == 1) { warningf("You only passed one base filter method to an ensemble filter. Please use at least two base filter methods to have a voting effect.") } method = list(method, base.methods) } select = checkFilterArguments(perc, abs, threshold) p = getTaskNFeats(task) nselect = switch(select, perc = round(perc * p), abs = min(abs, p), threshold = p) if (is.null(fval)) { if (!isFALSE(cache)) { requirePackages("memoise", why = "caching of filter features", default.method = "load") if (is.character(cache)) { assertString(cache) if (!dir.exists(cache)) { dir.create(cache, recursive = TRUE) } cache.dir = cache } else { assertFlag(cache) if (!dir.exists(rappdirs::user_cache_dir("mlr", "mlr-org"))) { dir.create(rappdirs::user_cache_dir("mlr", "mlr-org")) } cache.dir = rappdirs::user_cache_dir("mlr", "mlr-org") } cache.dir = memoise::cache_filesystem(cache.dir) generateFVData = memoise::memoise(generateFilterValuesData, cache = cache.dir) } else { generateFVData = generateFilterValuesData } Sys.sleep(sample(seq(3, 6, 0.1), 1)) fval = generateFVData(task = task, method = method, nselect = getTaskNFeats(task), ...)$data } else { assertClass(fval, "FilterValues") if (!is.null(fval$method)) { colnames(fval$data)[which(colnames(fval$data) == "val")] = fval$method method = fval$method fval = fval$data[, c(1, 3, 2)] } else { methods = unique(fval$data$method) if (length(methods) > 1) { assert(method %in% methods) } else { method = methods fval = fval$data } } } if (all(is.na(fval$value))) { stopf("Filter method returned all NA values!") } if (!is.null(mandatory.feat)) { assertCharacter(mandatory.feat) if (!all(mandatory.feat %in% fval$name)) { stop("At least one mandatory feature was not found in the task.") } if (select != "threshold" && nselect < length(mandatory.feat)) { stop("The number of features to be filtered cannot be smaller than the number of mandatory features.") } fval[fval$name %in% mandatory.feat, "value"] = Inf } if (select == "threshold") { nselect = sum(fval[["value"]] >= threshold, na.rm = TRUE) } if (length(levels(as.factor(fval$method))) >= 2) { if (is.null(select.method) && !method[[1]] %in% ls(.FilterEnsembleRegister)) { stopf("You supplied multiple filters. Please choose which should be used for the final subsetting of the features.") } if (is.null(select.method)) { fval = fval[fval$method == fval$method, ] } else { assertSubset(select.method, choices = unique(fval$method)) fval = fval[fval$method == select.method, ] } } if (nselect > 0L) { features = fval[with(fval, order(method, -value)), ] features = features[1:nselect, ]$name } else { features = NULL } allfeats = getTaskFeatureNames(task) j = match(features, allfeats) features = allfeats[sort(j)] subsetTask(task, features = features)})(task = list(type = "regr", env = , weights = NULL, blocking = c(4L, 1L, 4L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 4L, 1L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 1L, 1L, 4L, 4L, 4L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 1L, 4L, 4L, 4L, 4L, 1L, 4L, 1L, 4L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 4L, 1L, 4L, 4L, 4L, 1L, 4L, 1L, 1L, 4L, 1L, 4L, 1L, 4L, 4L, 4L, 4L, 1L, 4L, 1L, 1L, 4L, 1L, 1L, 4L, 1L, 1L, 4L, 1L, 1L, 4L, 4L, 4L, 1L, 4L, 1L, 1L, 4L, 1L, 4L, 1L, 4L, 4L, 4L, 1L, 1L, 4L, 1L, 4L, 1L, 1L, 4L, 1L, 4L, 1L, 4790897, 4799993.102, 4800002.857, 4790893, 4790876, 4800006.97685815, 4800013.6617686, 4790858, 4771426.845, 4799998.54, 4800010.219, 4790877, 4799992.057, 4790882, 4771418.434, 4790890, 4771432.395, 4799984.141, 4799998.412, 4790847, 4771422.806, 4799981.30153445, 4771413.558, 4790891, 4771432.742, 4800016.575, 4771412.433, 4790894, 4790863, 4799987.974, 4790891, 4790858, 4771456.45427037, 4800021.148, 4800000.14879441, 4799978.70878969, 4800011.11809915, 4799975.663, 4790860, 4771415.288, 4800003.956, 4771435.643, 4790869, 4790860, 4771445.658, 4771430.317, 4799994.439, 4790842, 4799999.08, 4799991.757, 4790866, 4800018.554, 4790861, 4771417.675, 4771430.011, 4790897, 4790886, 4790878, 4800012.04043339, 4799999.567, 4790835, 4790861, 4790895, 4800000.008, 4800008.09, 4800003.949, 4771468.40790618, 4799983.02172087, 4790846, 4790889, 4771464.51309507, 4790900, 4799995.636, 4771414.516, 4800007.153, 4790871, 4771439.091, 4799983.024, 4799988.907, 4799977.8611616, 4790838, 4799980.22953421, 4799992.469, 4800013.169, 4790852, 4790891, 4799972.42636971, 4800006.179, 4790875, 4771426.869, 4799984.953, 4790894, 4790891, 4800004.1376325, 4799991.07418776, 4799985.967, 4771404.083, 4771408.753, 4799987.032, 4799998.65499159, 4799983.745, 4790882, 4800006.459, 4800015.855, 4799986.016, 4771442.523, 4790904, 4800008.57521487, 4790851, 4799989.306, 4800014.49128894, 4800002.174, 4790876, 4799988.058, 4771406.11, 4790891, 4790884, 4790851, 4771429.776, 4790878, 4771404.55, 4771400.163, 4799993.614, 4790901, 4790880, 4790898, 4771414.018, 4790891, 4800010.068, 4799984.56739563, 4799994.247, 4790872, 4790831, 4800007.015, 4771431.37, 4771410.39729718, 4790874, 4800007.06331137, 4799986.58674491, 4799989.858, 4800007.355, 4790859, 4799974.81967256, 4790895, 4771465.98446815, 4771414.658, 4799999.362, 4790868, 4790894, 4799986.768, 4771410.234, 4790891, 4799991.143, 4790855, 4800000.97149227, 4790884, 4799976.8140916, 4790886, 4799987.251, 4800001.554, 4790840, 4771420.73, 4800005.23456297, 4800013.128, 4800002.602, 4771418.447, 4790847, 4800014.842, 4790860, 4790861, 1648L, 1474L, 1276L, 868L, 848L, 1732L, 1635L, 1161L, 924L, 1067L, 565L, 1204L, 1465L, 735L, 1619L, 1422L, 1352L, 1565L, 715L, 1046L, 1378L, 1366L, 1467L, 869L, 1628L, 885L, 651L, 526L, 1326L, 898L, 596L, 1298L, 1723L, 1739L, 1080L, 641L, 1381L, 824L, 930L, 1499L, 484L, 788L, 509L, 1092L, 718L, 1143L, 1138L, 536L, 644L, 1001L)), test.inds = list( 480:930, 931:1230, 1231:1759, 1:479), group = integer(0))), measures = list(list(id = "rmse", minimize = TRUE, properties = c("regr", "req.pred", "req.truth"), fun = function (task, model, pred, feats, extra.args) { measureRMSE(pred$data$truth, pred$data$response) }, extra.args = list(), best = 0, worst = Inf, name = "Root mean squared error", note = "The RMSE is aggregated as sqrt(mean(rmse.vals.on.test.sets^2)). If you don't want that, you could also use `test.mean`.", aggr = list(id = "test.mean", name = "Test mean", fun = function (task, perf.test, perf.train, measure, group, pred) mean(perf.test), properties = "req.test")), list(id = "rsq", minimize = FALSE, properties = c("regr", "req.pred", "req.truth"), fun = function (task, model, pred, feats, extra.args) { measureRSQ(pred$data$truth, pred$data$response) }, extra.args = list(), best = 1, worst = -Inf, name = "Coefficient of determination", note = "Also called R-squared, which is 1 - residual_sum_of_squares / total_sum_of_squares.", aggr = list(id = "test.mean", name = "Test mean", fun = function (task, perf.test, perf.train, measure, group, pred) mean(perf.test), properties = "req.test")), list(id = "expvar", minimize = FALSE, properties = c("regr", "req.pred", "req.truth"), fun = function (task, model, pred, feats, extra.args) { measureEXPVAR(pred$data$truth, pred$data$response) }, extra.args = list(), best = 1, worst = 0, name = "Explained variance", note = "Similar to measure rsq (R-squared). Defined as explained_sum_of_squares / total_sum_of_squares.", aggr = list(id = "test.mean", name = "Test mean", fun = function (task, perf.test, perf.train, measure, group, pred) mean(perf.test), properties = "req.test"))), keep.pred = TRUE, models = FALSE, show.info = TRUE, keep.extract = FALSE) 60: mapply(fun2, ..., MoreArgs = more.args, SIMPLIFY = FALSE, USE.NAMES = FALSE) 61: parallelMap(benchmarkParallel, task = grid$task, learner = grid$learner, more.args = list(learners = learners, tasks = tasks, resamplings = resamplings, measures = measures, keep.pred = keep.pred, models = models, show.info = show.info, keep.extract = keep.extract), level = plevel) 62: benchmark(learners = learner, tasks = task, models = FALSE, keep.pred = TRUE, resamplings = makeResampleDesc("CV", fixed = TRUE), show.info = TRUE, measures = list(setAggregation(rmse, test.mean), setAggregation(rsq, test.mean), setAggregation(expvar, test.mean))) 63: benchmark_custom_no_models(learner = svm_carscore_mbo, task = hr_task) 64: eval(quote(benchmark_custom_no_models(learner = svm_carscore_mbo, task = hr_task)), new.env()) 65: eval(quote(benchmark_custom_no_models(learner = svm_carscore_mbo, task = hr_task)), new.env()) 66: eval(expr, p) 67: eval(expr, p) 68: eval.parent(substitute(eval(quote(expr), envir))) 69: local(benchmark_custom_no_models(learner = svm_carscore_mbo, task = hr_task)) 70: eval(expr = tidy_expr, envir <- config$eval) 71: eval(expr = tidy_expr, envir <- config$eval) 72: withCallingHandlers(eval(expr = tidy_expr, envir <- config$eval), error = capture_calls) 73: doTryCatch(return(expr), name, parentenv, handler) 74: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 75: tryCatchList(expr, classes, parentenv, handlers) 76: tryCatch(withCallingHandlers(eval(expr = tidy_expr, envir <- config$eval), error = capture_calls), error = identity) 77: with_call_stack(target = target, config = config) 78: withCallingHandlers(value <- with_call_stack(target = target, config = config), warning = function(w) { config$logger$minor(paste("Warning:", w$message), target = target) warnings <<- c(warnings, w$message) invokeRestart("muffleWarning")}, message = function(m) { msg <- gsub(pattern = "\n$", replacement = "", x = m$message) config$logger$minor(msg, target = target) messages <<- c(messages, msg) invokeRestart("muffleMessage")}) 79: with_handling(target = target, meta = meta, config = config) 80: with_preserve_seed({ set.seed(seed) code}) 81: with_seed(meta$seed, with_handling(target = target, meta = meta, config = config)) 82: eval(expr, envir = envir) 83: eval(expr, envir = envir) 84: with_timeout(with_seed(meta$seed, with_handling(target = target, meta = meta, config = config)), cpu = timeouts[["cpu"]], elapsed = timeouts[["elapsed"]]) 85: with_seed_timeout(target = target, meta = meta, config = config) 86: try_build(target = target, meta = meta, config = config) 87: drake::cmq_build(target = target, meta = meta, deps = deps, layout = layout, config = config) 88: eval(msg$expr, envir = msg$env) 89: eval(msg$expr, envir = msg$env) 90: doTryCatch(return(expr), name, parentenv, handler) 91: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 92: tryCatchList(expr, classes, parentenv, handlers) 93: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L sm <- strsplit(conditionMessage(e), "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && isTRUE(getOption("show.error.messages"))) { cat(msg, file = outFile) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))}) 94: try(eval(msg$expr, envir = msg$env)) 95: clustermq:::worker("tcp://edi:7848") An irrecoverable exception occurred. R is aborting now ... /var/spool/slurmd/job10023/slurm_script: line 13: 13741 Illegal instruction CMQ_AUTH=znxjm R -q --no-save --no-restore -e 'clustermq:::worker("tcp://edi:7848")' ```

This expections seems to occur randomly to me: not for a specific partition, not for specific targets.

slurm template:

#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --partition=all
#SBATCH --output={{ log_file | /dev/null }} # you can add .%a for array index
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --cpus-per-task={{ n_cpus }}
#SBATCH --mem={{ memory }}
#SBATCH --array=1-{{ n_jobs }}

cd /home/patrick/papers/2019-feature-selection/
module load r-3.6.1-gcc-9.2.0-jzptgax pandoc-2.7.3-gcc-9.2.0-7hxzwvt

CMQ_AUTH={{ auth }} R -q --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

template arg of drake:

template = list(log_file = "log/worker%a.log", n_cpus = 1,
  memory = 2000, job_name = "paper2"),

As you can see I am not using internal parallelization and every worker just operates sequentially.

clustermq v0.8.8

mschubert commented 4 years ago

This looks like an error in compiled code on your cluster and not something that clustermq is responsible for.

Most likely:

Your R package was compiled on a node with a certain CPU instruction set enabled
Using many nodes, some of your calculations were sent to a node that does not support this instruction set
Trying to execute the compiled package function there results in a segfault

To confirm, separate your log files per worker (%a on slurm), and check /proc/cpuinfo for a node that works vs. a node that fails.

Please reopen if you think clustermq is still somehow responsible.

pat-s commented 4 years ago

Thanks. Meanwhile I found out that it might be related to openblas which was compiled on the main node and which is not compatible in certain numeric calculations with the architecture on some nodes.

I am back to using the default R BLAS/LAPACK from compatibility, even though this slows down things a bit.

mschubert / clustermq

Irrecoverable execptions during worker startups #180