mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
171 stars 51 forks source link

clusterFunctionsSSH don't work with remote System #182

Closed ja-thomas closed 6 years ago

ja-thomas commented 6 years ago

Hi,

Maybe I'm just too stupid to get it to work, but when I try to use a makeshift SSH cluster with remote systems I get the following error:

> submitJobs(1)
Error in submitJobs(1) : 
  Assertion on 'sys.cmd' failed: Must have length 1, but has length 0.

the problem is that the Worker w w$script is empty. When I set the worker to localhost there is a path to /batchtools/bin/linux-helper and it works.

Sorry I can't really give a reproducible example since the systems I'm using are only available in our vpn. And I don't have another server to test

The code looks like this: config:

cluster.functions = makeClusterFunctionsSSH(list(Worker$new("10.155.47.223", 1)))
library(batchtools)

unlink("asdf", recursive = TRUE)
reg = makeRegistry("asdf", packages = c("mxnet"), conf.file = ".batchtools.conf.R")

batchMap(function(x) {
  a <- mx.nd.ones(c(2,3), ctx = mx.gpu())
  b <- a * x + 1
  b
}, x = 1:10)

submitJobs(1)
mllg commented 6 years ago

Should work now. Re-open if you still have problems.

ja-thomas commented 6 years ago

Hm, I think we get a bit further, but still have the same error.

EDIT: No, I think it's still the same problem, i.e. w$script is empty.

Here's the traceback

Error in submitJobs(1) :
 Assertion on 'sys.cmd' failed: Must have length 1, but has length 0.
> traceback()
18: stop(simpleError(sprintf(msg, ...), sys.call(1L)))
17: mstop("Assertion on '%s' failed: %s.", var.name, res)
16: makeAssertion(x, res, .var.name, add)
15: assertCharacter(sys.cmd, any.missing = FALSE, len = 1L)
14: runOSCommand(self$script, args, nodename = self$nodename)
13: private$filter_output(runOSCommand(self$script, args, nodename = self$nodename))
12: private$run(c("list-jobs", reg$file.dir))
11: stri_trim_both(private$run(c("list-jobs", reg$file.dir))$output)
10: stri_join(self$nodename, "#", stri_trim_both(private$run(c("list-jobs",
       reg$file.dir))$output))
9: w$list(reg)
8: FUN(X[[i]], ...)
7: lapply(workers, function(w) w$list(reg))
6: unlist(lapply(workers, function(w) w$list(reg)), use.names = FALSE)
5: cf$listJobsRunning(reg)
4: unique(cf$listJobsRunning(reg))
3: getBatchIds(reg, status = status)
2: .findOnSystem(reg = reg, cols = c("job.id", "batch.id"))
1: submitJobs(1)
mllg commented 6 years ago

More quoting issues... :cry: -> 02f83b167fcb6aa38a89b7149d1c4258e5e1638b

I hope https://github.com/ropensci/ssh will be released soon ...

mllg commented 6 years ago

Can you try again?

ja-thomas commented 6 years ago

Hm, now I can't spawn the worker anymore

> Worker$new("10.155.47.223", 1)
Error in stop(simpleError(sprintf(...), call = sys.call(sys.parent()))) : 
  bad error message

traceback

5: stop(simpleError(sprintf(...), call = sys.call(sys.parent())))
4: stopf("runOSCommand failed: Expected BOF+EOF markers for '%s %s', but got:\n %s",
       res$sys.cmd, stri_flatten(res$sys.args, " "), stri_flatten(res$output, 
           "\n"))
3: private$filter_output(res)
2: .subset2(public_bind_env, "initialize")(...)
1: Worker$new("10.155.47.223", 1)
Worker$new("localhost", 1)

works, though.

mllg commented 6 years ago

Have you re-installed batchtools on all nodes?

ja-thomas commented 6 years ago

So after reinstalling on all node we can submit jobs, but they are expiring. Log file does not exist.

Here is the debugme output:

Sys.setenv(DEBUGME = "batchtools")
library(batchtools)

unlink("asdf", recursive = TRUE)
reg = makeRegistry("asdf", conf.file = ".batchtools.conf.R")

batchMap(function(x) {
  return(x)
}, x = 1:10)

submitJobs(1)
> reg = makeRegistry("asdf", conf.file = ".batchtools.conf.R")
Sourcing configuration file '.batchtools.conf.R' ...
batchtools [runOSCommand]: cmd:  
batchtools [runOSCommand]: cmd:  +33ms 
batchtools [runOSCommand]: cmd:  +2ms 
batchtools [runOSCommand]: cmd:  +1ms 
batchtools [runOSCommand]: cmd:  +1ms 
batchtools [runOSCommand]: cmd:  +1ms 
batchtools [runOSCommand]: cmd:  +1ms 
batchtools [runOSCommand]: cmd:  +1ms 
batchtools [runOSCommand]: cmd: ssh -q ubuntu@10.155.47.232 "Rscript -e 'message(\"[bt] --BOF--\\n\", \"[b
t] \", system.file(\"bin/linux-helper\", package = \"batchtools\"), \"\\n[bt] --EOF--\\n\")'" +1ms 
batchtools +-[runOSCommand]: OS result (stdin ' +623ms 
batchtools +-[runOSCommand]: OS result (stdin ' +7ms 
batchtools +-[runOSCommand]: OS result (stdin ' +3ms 
batchtools +-[runOSCommand]: OS result (stdin ' +3ms 
batchtools +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools [runOSCommand]: OS result (stdin '', exit code 0): +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools [runOSCommand]: During startup - Warning messages:
 +2ms batchtools [runOSCommand]: 1: Setting LC_TIME failed, using "C" 
 +2ms batchtools [runOSCommand]: 2: Setting LC_MONETARY failed, using "C" 
 +2ms batchtools [runOSCommand]: 3: Setting LC_PAPER failed, using "C" 
 +2ms batchtools [runOSCommand]: 4: Setting LC_MEASUREMENT failed, using "C" 
 +2ms batchtools [runOSCommand]: [bt] --BOF--
 +2ms batchtools [runOSCommand]: [bt] /usr/local/lib/R/site-library/batchtools/bin/linux-helper
 +2ms batchtools [runOSCommand]: [bt] --EOF--
 +2ms batchtools [runOSCommand]: 
 +2ms 
batchtools [makeRegistry]: Creating directories in ' +27ms 
batchtools [makeRegistry]: Creating directories in ' +2ms 
batchtools [makeRegistry]: Creating directories in ' +2ms 
batchtools [makeRegistry]: Creating directories in ' +2ms 
batchtools [makeRegistry]: Creating directories in ' +2ms 
batchtools [makeRegistry]: Creating directories in ' +2ms 
batchtools [makeRegistry]: Creating directories in ' +2ms 
batchtools [makeRegistry]: Creating directories in ' +2ms 
batchtools [makeRegistry]: Creating directories in 'asdf' +2ms 
batchtools +-[loadRegistryDependencies]: Starting ... +8ms 
batchtools +-[loadRegistryDependencies]: Starting ... +2ms 
batchtools +-[loadRegistryDependencies]: Starting ... +2ms 
batchtools +-[loadRegistryDependencies]: Starting ... +2ms 
batchtools +-[loadRegistryDependencies]: Starting ... +2ms 
batchtools +-[loadRegistryDependencies]: Starting ... +2ms 
batchtools +-[loadRegistryDependencies]: Starting ... +2ms 
batchtools +-[loadRegistryDependencies]: Starting ... +2ms 
batchtools +-[loadRegistryDependencies]: Starting ... +2ms 
batchtools +-[saveRegistry]: Saving Registry +3ms 
batchtools +-[saveRegistry]: Saving Registry +2ms 
batchtools +-[saveRegistry]: Saving Registry +2ms 
batchtools +-[saveRegistry]: Saving Registry +2ms 
batchtools +-[saveRegistry]: Saving Registry +2ms 
batchtools +-[saveRegistry]: Saving Registry +2ms 
batchtools +-[saveRegistry]: Saving Registry +2ms 
batchtools +-[saveRegistry]: Saving Registry +4ms 
batchtools +-[saveRegistry]: Saving Registry +2ms 
Created registry in '/home/janek/test_batchtools/asdf' using cluster functions 'SSH'
> 
> 
> batchMap(function(x) {
+   return(x)
+ }, x = 1:10)
Adding 10 jobs ...
batchtools [saveRegistry]: Saving Registry +2381ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
> 
> submitJobs(1)
batchtools [syncRegistry]: Triggered syncRegistry +1280ms 
batchtools [syncRegistry]: Triggered syncRegistry +2ms 
batchtools [syncRegistry]: Triggered syncRegistry +1ms 
batchtools [syncRegistry]: Triggered syncRegistry +1ms 
batchtools [syncRegistry]: Triggered syncRegistry +1ms 
batchtools [syncRegistry]: Triggered syncRegistry +2ms 
batchtools [syncRegistry]: Triggered syncRegistry +1ms 
batchtools [syncRegistry]: Triggered syncRegistry +1ms 
batchtools [syncRegistry]: Triggered syncRegistry +1ms 
batchtools [castIds]: Casting ids from vector to data.table +2ms 
batchtools [castIds]: Casting ids from vector to data.table +1ms 
batchtools [castIds]: Casting ids from vector to data.table +1ms 
batchtools [castIds]: Casting ids from vector to data.table +19ms 
batchtools [castIds]: Casting ids from vector to data.table +1ms 
batchtools [castIds]: Casting ids from vector to data.table +1ms 
batchtools [castIds]: Casting ids from vector to data.table +1ms 
batchtools [castIds]: Casting ids from vector to data.table +1ms 
batchtools [castIds]: Casting ids from vector to data.table +1ms 
batchtools [getBatchIds]: Getting running Jobs +7ms 
batchtools [getBatchIds]: Getting running Jobs +2ms 
batchtools [getBatchIds]: Getting running Jobs +1ms 
batchtools [getBatchIds]: Getting running Jobs +1ms 
batchtools [getBatchIds]: Getting running Jobs +1ms 
batchtools [getBatchIds]: Getting running Jobs +1ms 
batchtools [getBatchIds]: Getting running Jobs +2ms 
batchtools [getBatchIds]: Getting running Jobs +2ms 
batchtools [getBatchIds]: Getting running Jobs +2ms 
batchtools +-[runOSCommand]: cmd:  +34ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd: ssh -q ubuntu@10.155.47.232 '/usr/local/lib/R/site-library/batchtools/bi
n/linux-helper list-jobs /home/janek/test_batchtools/asdf' +2ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +373ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +6ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +4ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +3ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +3ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools +-[runOSCommand]: OS result (stdin '', exit code 0): +2ms 
batchtools   +-[runOSCommand]:  +2ms 
batchtools   +-[runOSCommand]:  +2ms 
batchtools   +-[runOSCommand]:  +2ms 
batchtools   +-[runOSCommand]:  +2ms 
batchtools   +-[runOSCommand]:  +2ms 
batchtools   +-[runOSCommand]:  +2ms 
batchtools   +-[runOSCommand]:  +2ms 
batchtools   +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]: [bt] --BOF--
 +2ms batchtools +-[runOSCommand]: [bt] --EOF--
 +2ms 
Submitting 1 jobs in 1 chunks using cluster functions 'SSH' ...
batchtools [Worker]: Updating Worker ' +82ms 
batchtools [Worker]: Updating Worker ' +2ms 
batchtools [Worker]: Updating Worker ' +2ms 
batchtools [Worker]: Updating Worker ' +1ms 
batchtools [Worker]: Updating Worker ' +1ms 
batchtools [Worker]: Updating Worker ' +1ms 
batchtools [Worker]: Updating Worker ' +2ms 
batchtools [Worker]: Updating Worker 'ubuntu@10.155.47.232' +1ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +2ms 
batchtools +-[runOSCommand]: cmd:  +1ms 
batchtools +-[runOSCommand]: cmd: ssh -q ubuntu@10.155.47.232 '/usr/local/lib/R/site-library/batchtools/bi
n/linux-helper status /home/janek/test_batchtools/asdf' +1ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +418ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +5ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +3ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +3ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +3ms 
batchtools   +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools +-[runOSCommand]: OS result (stdin '', exit code 0): +3ms 
batchtools   +-[runOSCommand]:  +3ms 
batchtools   +-[runOSCommand]:  +2ms 
batchtools   +-[runOSCommand]:  +2ms 
batchtools   +-[runOSCommand]:  +5ms 
batchtools   +-[runOSCommand]:  +3ms 
batchtools   +-[runOSCommand]:  +3ms 
batchtools   +-[runOSCommand]:  +2ms 
batchtools   +-[runOSCommand]:  +3ms 
batchtools +-[runOSCommand]: [bt] --BOF--
 +2ms batchtools +-[runOSCommand]: [bt] 0.01 0 0 0
 +2ms batchtools +-[runOSCommand]: [bt] --EOF--
 +2ms 
batchtools [runOSCommand]: cmd:  +7ms 
batchtools [runOSCommand]: cmd:  +2ms 
batchtools [runOSCommand]: cmd:  +2ms 
batchtools [runOSCommand]: cmd:  +2ms 
batchtools [runOSCommand]: cmd:  +2ms 
batchtools [runOSCommand]: cmd:  +2ms 
batchtools [runOSCommand]: cmd:  +2ms 
batchtools [runOSCommand]: cmd:  +2ms 
batchtools [runOSCommand]: cmd: ssh -q ubuntu@10.155.47.232 '/usr/local/lib/R/site-library/batchtools/bin/
linux-helper start-job /home/janek/test_batchtools/asdf/jobs/job1b087a98bff4077d9033a78a4e84ed51.rds /home
/janek/test_batchtools/asdf/logs/job1b087a98bff4077d9033a78a4e84ed51.log' +3ms 
batchtools +-[runOSCommand]: OS result (stdin ' +399ms 
batchtools +-[runOSCommand]: OS result (stdin ' +7ms 
batchtools +-[runOSCommand]: OS result (stdin ' +3ms 
batchtools +-[runOSCommand]: OS result (stdin ' +3ms 
batchtools +-[runOSCommand]: OS result (stdin ' +3ms 
batchtools +-[runOSCommand]: OS result (stdin ' +3ms 
batchtools +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools +-[runOSCommand]: OS result (stdin ' +2ms 
batchtools [runOSCommand]: OS result (stdin '', exit code 0): +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools +-[runOSCommand]:  +2ms 
batchtools [runOSCommand]: [bt] --BOF--
 +2ms batchtools [runOSCommand]: [bt] 7004
 +2ms batchtools [runOSCommand]: [bt] --EOF--
 +2ms batchtools [runOSCommand]: /usr/local/lib/R/site-library/batchtools/bin/linux-helper: line 53: /home
/janek/test_batchtools/asdf/logs/job1b087a98bff4077d9033a78a4e84ed51.log: No such file or directory
 +2ms 
batchtools [makeSubmitJobResult]: Result for batch.id ' +3ms 
batchtools [makeSubmitJobResult]: Result for batch.id ' +4ms 
batchtools [makeSubmitJobResult]: Result for batch.id ' +2ms 
batchtools [makeSubmitJobResult]: Result for batch.id ' +2ms 
batchtools [makeSubmitJobResult]: Result for batch.id ' +2ms 
batchtools [makeSubmitJobResult]: Result for batch.id ' +2ms 
batchtools [makeSubmitJobResult]: Result for batch.id ' +2ms 
batchtools [makeSubmitJobResult]: Result for batch.id ' +2ms 
batchtools [makeSubmitJobResult]: Result for batch.id 'ubuntu@10.155.47.232#7004,': 0 (OK) +2ms 
batchtools [saveRegistry]: Saving Registry +4ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
batchtools [saveRegistry]: Saving Registry +2ms 
mllg commented 6 years ago

You need the same (relative) file system layout on your machine and the node. You are logging in as user ubuntu but try to access /home/janek.

Try creating the registry with a file.dir provided not relative to your current work dir. ~ will not be resolved to a complete path (or at least it should not be resolved, this is hard to test) and can be used in your case:

reg = makeRegistry("~/asdf", conf.file = "~/.batchtools.conf.R")
mllg commented 6 years ago

Did this work for you?

NB: if you install the devel version of debugme, the output is not cluttered.

ja-thomas commented 6 years ago

Sorry for the late reply.

It works now, thanks for your help!