Closed shikokuchuo closed 2 weeks ago
5a7a73a removes additional shell quoting around the Rscript -e '...'
command, which might be what's causing the issue for srun
.
I've tested with SSH and it works fine. So this may have been required at some earlier point in time, but is no longer the case.
@michaelmayer2 can you let me know if this works (with build 9002 from R-universe)? The difference with respect to your examples is that I retain the use of system2
vs system
hence it is working for SSH.
If this is not the issue, then it might be a special case for srun
as I see it allows passing multiple commands separated by :
, hence it may be doing its own custom parsing than just following the standard execve
practice. I'll have to think about handling SLURM differently if that's the case.
I can confirm that build 9002 works with SLURM / srun
which is great - As suspected however I get the old error when running ssh
via
> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3; LAPACK version 3.9.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] parallel stats graphics grDevices datasets utils methods base
other attached packages:
[1] slurmR_0.5-4 furrr_0.3.1 future.mirai_0.2.1 future_1.33.2 mirai_1.1.0.9002
loaded via a namespace (and not attached):
[1] digest_0.6.35 codetools_0.2-19 magrittr_2.0.3 nanonext_1.1.0.9001 lifecycle_1.0.4 cli_3.6.2 parallelly_1.37.1 vctrs_0.6.5
[9] renv_1.0.7 compiler_4.3.2 purrr_1.0.2 globals_0.16.3 rstudioapi_0.16.0 tools_4.3.2 listenv_0.9.1 rlang_1.1.2
> mirai::daemons(compute_cores,
+ url = host_url(tls = FALSE),
+ remote = ssh_config(
+ remotes = "ssh://localhost",
+ timeout = 1,
+ rscript = file.path(R.home("bin"), "Rscript")
+ )
+ )
[1] 1
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
bash: -c: line 0: syntax error near unexpected token `('
bash: -c: line 0: `/opt/R/4.3.2/lib/R/bin/Rscript -e mirai::daemon("tcp://interactive-st-rstudio-1:44937",rs=c(10407,-1181744049,-321227196,959850613,-1193191758,-729057653,-1086456752))'
Thanks Michael, I'm not sure how I tested SSH as working earlier. I am getting the same result as you.
In build 9003, I have SSH now working by double shell-quoting that argument (the part after Rscript -e
).
Hopefully you can confirm that this still works for srun
as the whole expression remains unquoted.
Thanks for all your efforts, Charlie.
While the behaviour with 9003 is different, it still does not seem to work. I have set a browser()
statement just after the cmds[]
are created to see what is going on.
> daemons(compute_cores,
+ url = host_url(ws=TRUE, tls = TRUE),
+ remote = remote_config(
+ command="srun",
+ args=c("--mem 512", "-n 1", "."),
+ rscript = paste0(Sys.getenv("R_HOME"),"/bin/Rscript")
+ ),
+ dispatcher=TRUE
+ )
Generating key + certificate [done]
Called from: launch_remote(url = envir[["urls"]], remote = remote, tls = envir[["tls"]],
..., .compute = .compute)
Browse[1]> cmds
[1] "/opt/R/4.3.2/lib/R/bin/Rscript -e \"\\\"mirai::daemon(\\\\\\\"wss://interactive-st-rstudio-1:34773\\\\\\\",tls=c('-----BEGIN CERTIFICATE-----\nMIIFVDCCAzygAwIBAgIBATANBgkqhkiG9w0BAQsFADBDMSEwHwYDVQQDDBhpbnRl\ncmFjdGl2ZS1zdC1yc3R1ZGlvLTExETAPBgNVBAoMCE5hbm9uZXh0MQswCQYDVQQG\nEwJKUDAeFw0wMTAxMDEwMDAwMDBaFw0zMDEyMzEyMzU5NTlaMEMxITAfBgNVBAMM\nGGludGVyYWN0aXZlLXN0LXJzdHVkaW8tMTERMA8GA1UECgwITmFub25leHQxCzAJ\nBgNVBAYTAkpQMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAlAKAsPEA\nmU685D40gGGS7gpkX/148fJyQXyP9iAnfFFBk0+oc4d77nEAsjsZNqyf0U0+BbXK\n37szO11mFXgzvwlTODh+U9KzIAJ+7Ki2rMM4blsAt8vVzXl9+sQb17hyBDntj5np\nxo7CHdwSdoHkA7SA8ZAt7avwI28uIWY0Ti7Hq9Wmc5bdLE2CkM9v4Qu2Bwj8lbGP\nUD7L17bcNQRofyxiJmT4kJKqNilPwhoAe7Y3O6VCRwDn0sH0aUAPuQR1EchGLnT1\nIYVRji5julw0xVAK3MR4qp2gmgMvLAVdCUVibEEdRevpiwotp5BbE5fTm+iC1P+v\ni472OXQ904Zqxda7umfbaxXdyC050ui+Gp88LRGDPamszXLRiQ8MaZ5/VNEIbXGR\ni091LsQSfGeQR1nDAyGcVF5ab8N0Fi11XVHTNdaRWIcE2+jfyRg13SpsVupuTTPQ\nuvkLL+Fe3FX2jbpm3ouzpF7OgtRpltacOV7GjA6F8aJJHKGc5Ita614AVL8JbuV0\na36H54ABNxXdMFOcLxrN9ZRbsQWt1R6tZ43LeIBe39iMnLM2hYRLXop9C1oagWMa\nWgt2Eg75DkgC/CuH605rctLCBWHWr1O7NozINjecA6WvfO22V4C+N3VwkYUxTeK1\nIO6/yaLGyd9XP00a+wKYmvGepmVB7sTWK4ECAwEAAaNTMFEwDwYDVR0TBAgwBgEB\n/wIBADAdBgNVHQ4EFgQUV80EGqIC+c8uZFiw3BzGDJKxaXEwHwYDVR0jBBgwFoAU\nV80EGqIC+c8uZFiw3BzGDJKxaXEwDQYJKoZIhvcNAQELBQADggIBADYOfnlMLYNv\nySANqxMwYVfXtB0OV3DWPgxTUJ0/uahM/potWMZfAm1IvNz5g4rto1LgTPxyDODg\nAF2naQLT9mG5G4Tgjc1oLEDluzr/eWKqTKJMU2Ms/eLomgexk6pseS1fNYVc1YLV\nyfyWVcmEi1oR8Y5yvgDsYLtX091aIU00ASNfUX0ZuTSJaMK9DFlVYKbAHa1i+sDG\nA2kOhVtpmxcWz38hKY94/HmJas5k7Iij+eNjU43FZTv4pZsNx0P4b045imniZs8G\nLludi2JQWChaxFNCOEsny9HANzBmnP9TbuL0h8/xeKcyVahxmS3DKB+QWyLo0Awx\nZoc5nUZI3rWBKI1VtnBkrXLmjWSrTx40hGhcc4gZSefnAO9j7OdKNui9xbTjNI3g\nZolXjAW7Ab+FaWK/1UXAhwRelOqO4qFxkc3LeapRcazihG/l5AXu6vzWRo+gQnmZ\nQzJO01c3BvYur0XREx3U66QBfkskfZJoNgDTMsXk2o3l1l5B1M0E7xGr+16DupDP\nAekUfEM9AtxVz56vhLgcPBmk6LRcjIovVrxq6y5juuUKYYy042U/W59ZRuXAILXY\nS72Yk5StZOp0YmdwYLFxxNJBaSYoCK0ml4gqBlufQPViI5TQmSsMNA9qQcfc8L+c\na5OGi9XyMqgt3Wd/BoRcqTiOXrEu+RgJ\n-----END CERTIFICATE-----\n',''),rs=c(10407,-228603134,-1090082149,-107250400,1140708001,680843246,-434765929))\\\"\""
Browse[1]> c
[1] 1
[1] "mirai::daemon(\"wss://interactive-st-rstudio-1:34773\",tls=c('-----BEGIN CERTIFICATE-----\nMIIFVDCCAzygAwIBAgIBATANBgkqhkiG9w0BAQsFADBDMSEwHwYDVQQDDBhpbnRl\ncmFjdGl2ZS1zdC1yc3R1ZGlvLTExETAPBgNVBAoMCE5hbm9uZXh0MQswCQYDVQQG\nEwJKUDAeFw0wMTAxMDEwMDAwMDBaFw0zMDEyMzEyMzU5NTlaMEMxITAfBgNVBAMM\nGGludGVyYWN0aXZlLXN0LXJzdHVkaW8tMTERMA8GA1UECgwITmFub25leHQxCzAJ\nBgNVBAYTAkpQMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAlAKAsPEA\nmU685D40gGGS7gpkX/148fJyQXyP9iAnfFFBk0+oc4d77nEAsjsZNqyf0U0+BbXK\n37szO11mFXgzvwlTODh+U9KzIAJ+7Ki2rMM4blsAt8vVzXl9+sQb17hyBDntj5np\nxo7CHdwSdoHkA7SA8ZAt7avwI28uIWY0Ti7Hq9Wmc5bdLE2CkM9v4Qu2Bwj8lbGP\nUD7L17bcNQRofyxiJmT4kJKqNilPwhoAe7Y3O6VCRwDn0sH0aUAPuQR1EchGLnT1\nIYVRji5julw0xVAK3MR4qp2gmgMvLAVdCUVibEEdRevpiwotp5BbE5fTm+iC1P+v\ni472OXQ904Zqxda7umfbaxXdyC050ui+Gp88LRGDPamszXLRiQ8MaZ5/VNEIbXGR\ni091LsQSfGeQR1nDAyGcVF5ab8N0Fi11XVHTNdaRWIcE2+jfyRg13SpsVupuTTPQ\nuvkLL+Fe3FX2jbpm3ouzpF7OgtRpltacOV7GjA6F8aJJHKGc5Ita614AVL8JbuV0\na36H54ABNxXdMFOcLxrN9ZRbsQWt1R6tZ43LeIBe39iMnLM2hYRLXop9C1oagWMa\nWgt2Eg75DkgC/CuH605rctLCBWHWr1O7NozINjecA6WvfO22V4C+N3VwkYUxTeK1\nIO6/yaLGyd9XP00a+wKYmvGepmVB7sTWK4ECAwEAAaNTMFEwDwYDVR0TBAgwBgEB\n/wIBADAdBgNVHQ4EFgQUV80EGqIC+c8uZFiw3BzGDJKxaXEwHwYDVR0jBBgwFoAU\nV80EGqIC+c8uZFiw3BzGDJKxaXEwDQYJKoZIhvcNAQELBQADggIBADYOfnlMLYNv\nySANqxMwYVfXtB0OV3DWPgxTUJ0/uahM/potWMZfAm1IvNz5g4rto1LgTPxyDODg\nAF2naQLT9mG5G4Tgjc1oLEDluzr/eWKqTKJMU2Ms/eLomgexk6pseS1fNYVc1YLV\nyfyWVcmEi1oR8Y5yvgDsYLtX091aIU00ASNfUX0ZuTSJaMK9DFlVYKbAHa1i+sDG\nA2kOhVtpmxcWz38hKY94/HmJas5k7Iij+eNjU43FZTv4pZsNx0P4b045imniZs8G\nLludi2JQWChaxFNCOEsny9HANzBmnP9TbuL0h8/xeKcyVahxmS3DKB+QWyLo0Awx\nZoc5nUZI3rWBKI1VtnBkrXLmjWSrTx40hGhcc4gZSefnAO9j7OdKNui9xbTjNI3g\nZolXjAW7Ab+FaWK/1UXAhwRelOqO4qFxkc3LeapRcazihG/l5AXu6vzWRo+gQnmZ\nQzJO01c3BvYur0XREx3U66QBfkskfZJoNgDTMsXk2o3l1l5B1M0E7xGr+16DupDP\nAekUfEM9AtxVz56vhLgcPBmk6LRcjIovVrxq6y5juuUKYYy042U/W59ZRuXAILXY\nS72Yk5StZOp0YmdwYLFxxNJBaSYoCK0ml4gqBlufQPViI5TQmSsMNA9qQcfc8L+c\na5OGi9XyMqgt3Wd/BoRcqTiOXrEu+RgJ\n-----END CERTIFICATE-----\n',''),rs=c(10407,-228603134,-1090082149,-107250400,1140708001,680843246,-434765929))"
> daemons()
$connections
[1] 1
$daemons
i online instance assigned complete
wss://interactive-st-rstudio-1:34773 1 0 0 0 0
What I can see on the SLURM side is that the SLURM job has completed sucessfully with a runtime of zero seconds. If I take the output string and run it against srun
the mirai daemon becomes online, i.e.
srun --mem 512 /opt/R/4.3.2/bin/Rscript -e "mirai::daemon(\"wss://interactive-st-rstudio-1:39613\",tls=c('-----BEGIN CERTIFICATE-----\nMIIFVDCCAzygAwIBAgIBATANBgkqhkiG9w0BAQsFADBDMSEwHwYDVQQDDBhpbnRl\ncmFjdGl2ZS1zdC1yc3R1ZGlvLTExETAPBgNVBAoMCE5hbm9uZXh0MQswCQYDVQQG\nEwJKUDAeFw0wMTAxMDEwMDAwMDBaFw0zMDEyMzEyMzU5NTlaMEMxITAfBgNVBAMM\nGGludGVyYWN0aXZlLXN0LXJzdHVkaW8tMTERMA8GA1UECgwITmFub25leHQxCzAJ\nBgNVBAYTAkpQMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAkeXSYgHO\ntE7nNPpinWIfJ7u4uPR8SKraYYUkNjI0VFBwlVIz3cXFTiYZ5QDpar613zwKzjWh\nFj10oA9xP/c2pOikE8ODelOBAscvE6ldBv1FbDeMDH+nJbnSMPEyVXCQc84eGYZo\nx81d+t8meHgMQV1iaufp1fH92Xj00iL/PPp3dO37XJ+SPk2eTVp/f9kvo24ckNir\nPy0CQcNjqz5qrmSRDFulB35/gDHW+asJ/YhDDFyhMxLfXnBrtRjpIygme3YL4WP6\ntrq3GfQK7AyAkOE6b3p/cQ8A/vbvZ2cYQH1tox9/ILNIiN+0KoB/qirLT4sg0oAt\ni8s1Kf8Nb0Kpe/4xnHelpxEf5sKmu61YS87XTyZKMYyXusGwoowdmj5ro73l/Z9g\nZn+tiT6PXPtSgUwf6fJugBF9mVTKjBB038tPQCJ3nnKgAnooGZAn+YZ43e7nXCl5\nDPQzkpnutzamSes6BsPF34VoBWtX1K4NWxacaUbpy7K5cDHmbhOMNPDmFe60BXNH\nq95ZsMJupw0LA2qhlEu5gRaC1uD1NIYlINYvVcL9+fV2ExElN0B0VTsAJyIbEI2M\nHocaDShi/mlXnqEFqmgN4EElzdiiVdt/cCMN0FgcB3VQfCpH2kZEPkN4SPX5xmJd\nENRh00NaLf8cBdD9b8rQBPotTDv6ZnOtnpECAwEAAaNTMFEwDwYDVR0TBAgwBgEB\n/wIBADAdBgNVHQ4EFgQUTc4OuPP9aZ52gQNRzN2PN25E4v0wHwYDVR0jBBgwFoAU\nTc4OuPP9aZ52gQNRzN2PN25E4v0wDQYJKoZIhvcNAQELBQADggIBABUqsIzwQ7jB\nkLC9C1nS35aXN0Nq5ZqFUAI+yoo5M5aqMWYgCWpeQHGf09gFZj+C7jzanuxAHZ3R\nkYgh9WSGQo1DFIy6O4VUZw3evLQugYifzayU0XoTeGjpeG98SbfZ8ZeXHtjMQ3Op\nGF79wH386PnVwswB+faM74pu5KwnkFGq9EKnnKuLxlsF7+KjI160K6jGtU2XnVyP\nl/3PH/I/3WdPWmlMcH3LUIMT2pb5cuhlNpWro/oGPX+XDk/h1oMfGbSHg6d0rOjQ\nYqazx2k8tWPACWPSap5+BRB8gk6b+BeZ6+mdjgihXhAbx7Oz+QsCZSKgQHJtM55v\nomasn//ACC6WA9dSxfB1LdER8Lq3WQxn+zqdbFdmOAbdnl0JRgc5ZduupoHpKCgP\nqnzlnBpvZf7gnLp2ziwfx2AKQzM8NZOVEWwcM7DCsS/CTkpJ7KVbv+E8VGpO38YZ\npzH1B0LYuQbIFtcQgUEl7tx6p3IktycHFZlZ2YJjhTwOMuWh9jDS34EfHw+phaVX\nWLlWgdAnM08SHku5MvDO1fVNEuJ/l8jCyqtjf+aztEeqlErsNV2ceMSv4x7JQ2ef\n6Lq9SH4aJuW+ZO1spx7JhLDfPu8LkokAvC0Y+F72NILRvpwUq6nOxGxTk0i8z5Pg\nit3eJIL31bwERqXNsOTq12ROk7Q/EiLd\n-----END CERTIFICATE-----\n',''),rs=c(10407,1180069229,607201610,-1547644349,-1101478232,-1244291959,294276790))"
works and also polls indefinitely, i.e. does not stop until you stop the mirai cluster.
Thanks for the detailed output. From the above I also spotted that I hadn't switched the quoting behaviour for the TLS certificates.
I'm about to push a fix for this entire issue. There are two different behaviours required for ssh
and srun
respectively, and they will be governed by an additional argument in remote_config()
, with the defaults working in each case.
I came back to this and I figured out an alternative way to run SLURM tasks would be to use
remote = remote_config(
command = "sbatch",
args = c("--mem 512", "-n 1", "--wrap", "."),
rscript = file.path(R.home("bin"), "Rscript"),
quote = TRUE
),
The benefit of using sbatch
is that the job accounting information is much more accessible when compared to srun
.
Thanks Michael. Just to confirm my understanding, you're saying that using sbatch
is preferable and this requires remote_config(quote = TRUE)
? If that's the case then I'd like to make quote = TRUE
the default as it seems it's also the case for most other commands.
Thanks Michael. Just to confirm my understanding, you're saying that using
sbatch
is preferable and this requiresremote_config(quote = TRUE)
? If that's the case then I'd like to makequote = TRUE
the default as it seems it's also the case for most other commands.
Yes, sbatch
feels very much preferrable to me and requires remote_config(quote=TRUE)
I've realised it's too disruptive to change the default, and quote = FALSE
is actually used in some other contexts, so I've just added a Slurm sbatch example to the remote_config()
docs. Thanks for reporting back on this @michaelmayer2!
Raised at: https://github.com/HenrikBengtsson/future.mirai/issues/12#issuecomment-2177929087
39ce672 does not solve the issue, but is a step in the right direction with a more preferred shell quoting scheme.