Closed pat-s closed 4 years ago
This looks like an error in compiled code on your cluster and not something that clustermq
is responsible for.
Most likely:
To confirm, separate your log files per worker (%a
on slurm), and check /proc/cpuinfo
for a node that works vs. a node that fails.
Please reopen if you think clustermq
is still somehow responsible.
Thanks.
Meanwhile I found out that it might be related to openblas
which was compiled on the main node and which is not compatible in certain numeric calculations with the architecture on some nodes.
I am back to using the default R BLAS/LAPACK from compatibility, even though this slows down things a bit.
I am spawning > 100 workers via drake.
For some, I constantly get the following
worker log
```r During startup - Warning messages: 1: Setting LC_CTYPE failed, using "C" 2: Setting LC_COLLATE failed, using "C" 3: Setting LC_TIME failed, using "C" 4: Setting LC_MESSAGES failed, using "C" 5: Setting LC_MONETARY failed, using "C" 6: Setting LC_PAPER failed, using "C" 7: Setting LC_MEASUREMENT failed, using "C" > clustermq:::worker("tcp://edi:7848") 2019-11-16 11:19:49.663881 | Master: tcp://edi:7848 2019-11-16 11:19:49.880703 | WORKER_UP to: tcp://edi:7848 2019-11-16 11:19:51.230083 | > DO_SETUP (1.084s wait) 2019-11-16 11:19:51.234778 | token from msg: set_common_data_token 2019-11-16 11:20:45.251626 | > DO_CALL (5.778s wait) .rgdal: version: 1.4-4, (SVN revision 833) Geospatial Data Abstraction Library extensions to R successfully loaded Loaded GDAL runtime: GDAL 2.4.2, released 2019/06/28 Path to GDAL shared files: /opt/spack/opt/spack/linux-centos7-x86_64/gcc-9.2.0/gdal-2.4.2-henhg265ta2tkqlfzzk3eukoet6sywpy/share/gdal GDAL binary built with GEOS: TRUE Loaded PROJ.4 runtime: Rel. 5.2.0, September 15th, 2018, [PJ_VERSION: 520] Path to PROJ.4 shared files: (autodetected) Linking to sp version: 1.3-1 Attaching package: 'signal' The following object is masked from 'package:raster': resample The following objects are masked from 'package:stats': filter, poly Attaching package: 'caret' The following object is masked from 'package:drake': progress ################################### This is hsdar 0.5.2 To get citation entry type 'citation("hsdar")' ################################### Attaching package: 'hsdar' The following object is masked from 'package:raster': nbands Attaching package: 'dplyr' The following object is masked from 'package:signal': filter The following objects are masked from 'package:raster': intersect, select, union The following objects are masked from 'package:stats': filter, lag The following objects are masked from 'package:base': intersect, setdiff, setequal, union Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0 Attaching package: 'purrr' The following object is masked from 'package:caret': lift Attaching package: 'glue' The following object is masked from 'package:dplyr': collapse The following object is masked from 'package:raster': trim Registered S3 method overwritten by 'R.oo': method from throw.default R.methodsS3 Attaching package: 'R.oo' The following object is masked from 'package:glue': trim The following objects are masked from 'package:raster': extend, trim The following object is masked from 'package:rgdal': getDescription The following object is masked from 'package:drake': check The following objects are masked from 'package:methods': getClasses, getMethods The following objects are masked from 'package:base': attach, detach, gc, load, save Attaching package: 'R.utils' The following objects are masked from 'package:signal': resample, unwrap The following objects are masked from 'package:raster': extract, resample The following object is masked from 'package:drake': evaluate The following object is masked from 'package:utils': timestamp The following objects are masked from 'package:base': cat, commandArgs, getOption, inherits, isOpen, nullfile, parse, warnings Attaching package: 'future' The following object is masked from 'package:caret': cluster The following object is masked from 'package:raster': values The following object is masked from 'package:drake': plan Attaching package: 'magrittr' The following object is masked from 'package:R.utils': extract The following object is masked from 'package:R.oo': equals The following object is masked from 'package:purrr': set_names The following object is masked from 'package:raster': extract Attaching package: 'data.table' The following object is masked from 'package:purrr': transpose The following objects are masked from 'package:dplyr': between, first, last The following object is masked from 'package:raster': shift Attaching package: 'ParamHelpers' The following object is masked from 'package:R.utils': isVector The following object is masked from 'package:raster': getValues Attaching package: 'mlr' The following objects are masked from 'package:R.utils': resample, setThreshold The following object is masked from 'package:hsdar': train The following object is masked from 'package:caret': train The following object is masked from 'package:signal': resample The following object is masked from 'package:raster': resample Attaching package: 'BBmisc' The following objects are masked from 'package:R.utils': getRelativePath, insert, isDirectory, setValue The following object is masked from 'package:glue': collapse The following objects are masked from 'package:dplyr': coalesce, collapse The following object is masked from 'package:base': isFALSE Attaching package: 'checkmate' The following object is masked from 'package:R.utils': asInt Attaching package: 'smoof' The following objects are masked from 'package:R.oo': getDescription, getName The following object is masked from 'package:rgdal': getDescription Attaching package: 'emoa' The following object is masked from 'package:BBmisc': coalesce The following object is masked from 'package:dplyr': coalesce ## rgenoud (Version 5.8-3.0, Build Date: 2019-01-22) ## See http://sekhon.berkeley.edu/rgenoud for additional documentation. ## Please cite software as: ## Walter Mebane, Jr. and Jasjeet S. Sekhon. 2011. ## ``Genetic Optimization Using Derivatives: The rgenoud package for R.'' ## Journal of Statistical Software, 42(11): 1-26. ## Attaching package: 'gdalUtils' The following object is masked from 'package:sf': gdal_rasterize Attaching package: 'ggpubr' The following object is masked from 'package:raster': rotate here() starts at /home/patrick/papers/2019-feature-selection This is workflowr version 1.5.0 Run ?workflowr for help getting started Attaching package: 'survival' The following object is masked from 'package:future': cluster The following object is masked from 'package:caret': cluster Attaching package: 'igraph' The following object is masked from 'package:BBmisc': normalize The following object is masked from 'package:fs': path The following objects are masked from 'package:future': %->%, %<-% The following object is masked from 'package:R.oo': hierarchy The following objects are masked from 'package:purrr': compose, simplify The following objects are masked from 'package:dplyr': as_data_frame, groups, union The following object is masked from 'package:raster': union The following object is masked from 'package:drake': read_graph The following objects are masked from 'package:stats': decompose, spectrum The following object is masked from 'package:base': union Attaching package: 'mRMRe' The following object is masked from 'package:drake': target Attaching package: 'kernlab' The following object is masked from 'package:purrr': cross The following object is masked from 'package:ggplot2': alpha The following objects are masked from 'package:raster': buffer, rotated ... *** caught illegal operation *** address 0x28f71bd6, cause 'unknown' Traceback: 1: crossprod(m) 2: psmall.svd(m, tol) 3: fast.svd(xsw) 4: estimate.lambda(cbind(Ytrain, Xtrain), verbose = verbose) 5: care::carscore(Xtrain = data$data, Ytrain = data$target, verbose = FALSE, ...) 6: (function (task, nselect, ...) { data = getTaskData(task, target.extra = TRUE) y = care::carscore(Xtrain = data$data, Ytrain = data$target, verbose = FALSE, ...)^2 setNames(as.double(y), names(y))})(task = list(type = "regr", env =This expections seems to occur randomly to me: not for a specific partition, not for specific targets.
slurm template:
template
arg of drake:As you can see I am not using internal parallelization and every worker just operates sequentially.
clustermq v0.8.8