shikokuchuo / mirai

mirai - Minimalist Async Evaluation Framework for R
https://shikokuchuo.net/mirai/
GNU General Public License v3.0
193 stars 10 forks source link

Make `remote_config()` work with SLURM `srun` command #119

Closed shikokuchuo closed 2 weeks ago

shikokuchuo commented 5 months ago

Raised at: https://github.com/HenrikBengtsson/future.mirai/issues/12#issuecomment-2177929087

39ce672 does not solve the issue, but is a step in the right direction with a more preferred shell quoting scheme.

shikokuchuo commented 5 months ago

5a7a73a removes additional shell quoting around the Rscript -e '...' command, which might be what's causing the issue for srun.

I've tested with SSH and it works fine. So this may have been required at some earlier point in time, but is no longer the case.

@michaelmayer2 can you let me know if this works (with build 9002 from R-universe)? The difference with respect to your examples is that I retain the use of system2 vs system hence it is working for SSH.

If this is not the issue, then it might be a special case for srun as I see it allows passing multiple commands separated by :, hence it may be doing its own custom parsing than just following the standard execve practice. I'll have to think about handling SLURM differently if that's the case.

michaelmayer2 commented 5 months ago

I can confirm that build 9002 works with SLURM / srun which is great - As suspected however I get the old error when running ssh via

> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3 
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8      
 [8] LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] slurmR_0.5-4       furrr_0.3.1        future.mirai_0.2.1 future_1.33.2      mirai_1.1.0.9002  

loaded via a namespace (and not attached):
 [1] digest_0.6.35       codetools_0.2-19    magrittr_2.0.3      nanonext_1.1.0.9001 lifecycle_1.0.4     cli_3.6.2           parallelly_1.37.1   vctrs_0.6.5        
 [9] renv_1.0.7          compiler_4.3.2      purrr_1.0.2         globals_0.16.3      rstudioapi_0.16.0   tools_4.3.2         listenv_0.9.1       rlang_1.1.2        
> mirai::daemons(compute_cores,
+                url = host_url(tls = FALSE),
+                remote = ssh_config(
+                    remotes = "ssh://localhost",
+                    timeout = 1,
+                    rscript = file.path(R.home("bin"), "Rscript")
+                )
+ )
[1] 1
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
bash: -c: line 0: syntax error near unexpected token `('
bash: -c: line 0: `/opt/R/4.3.2/lib/R/bin/Rscript -e mirai::daemon("tcp://interactive-st-rstudio-1:44937",rs=c(10407,-1181744049,-321227196,959850613,-1193191758,-729057653,-1086456752))'
shikokuchuo commented 5 months ago

Thanks Michael, I'm not sure how I tested SSH as working earlier. I am getting the same result as you.

In build 9003, I have SSH now working by double shell-quoting that argument (the part after Rscript -e).

Hopefully you can confirm that this still works for srun as the whole expression remains unquoted.

michaelmayer2 commented 5 months ago

Thanks for all your efforts, Charlie.

While the behaviour with 9003 is different, it still does not seem to work. I have set a browser() statement just after the cmds[] are created to see what is going on.

> daemons(compute_cores,
+                url = host_url(ws=TRUE, tls = TRUE),
+                remote = remote_config(
+                  command="srun",
+                  args=c("--mem 512", "-n 1", "."),
+                  rscript = paste0(Sys.getenv("R_HOME"),"/bin/Rscript")
+                ),
+                dispatcher=TRUE
+ )
Generating key + certificate [done]
Called from: launch_remote(url = envir[["urls"]], remote = remote, tls = envir[["tls"]], 
    ..., .compute = .compute)
Browse[1]> cmds
[1] "/opt/R/4.3.2/lib/R/bin/Rscript -e \"\\\"mirai::daemon(\\\\\\\"wss://interactive-st-rstudio-1:34773\\\\\\\",tls=c('-----BEGIN CERTIFICATE-----\nMIIFVDCCAzygAwIBAgIBATANBgkqhkiG9w0BAQsFADBDMSEwHwYDVQQDDBhpbnRl\ncmFjdGl2ZS1zdC1yc3R1ZGlvLTExETAPBgNVBAoMCE5hbm9uZXh0MQswCQYDVQQG\nEwJKUDAeFw0wMTAxMDEwMDAwMDBaFw0zMDEyMzEyMzU5NTlaMEMxITAfBgNVBAMM\nGGludGVyYWN0aXZlLXN0LXJzdHVkaW8tMTERMA8GA1UECgwITmFub25leHQxCzAJ\nBgNVBAYTAkpQMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAlAKAsPEA\nmU685D40gGGS7gpkX/148fJyQXyP9iAnfFFBk0+oc4d77nEAsjsZNqyf0U0+BbXK\n37szO11mFXgzvwlTODh+U9KzIAJ+7Ki2rMM4blsAt8vVzXl9+sQb17hyBDntj5np\nxo7CHdwSdoHkA7SA8ZAt7avwI28uIWY0Ti7Hq9Wmc5bdLE2CkM9v4Qu2Bwj8lbGP\nUD7L17bcNQRofyxiJmT4kJKqNilPwhoAe7Y3O6VCRwDn0sH0aUAPuQR1EchGLnT1\nIYVRji5julw0xVAK3MR4qp2gmgMvLAVdCUVibEEdRevpiwotp5BbE5fTm+iC1P+v\ni472OXQ904Zqxda7umfbaxXdyC050ui+Gp88LRGDPamszXLRiQ8MaZ5/VNEIbXGR\ni091LsQSfGeQR1nDAyGcVF5ab8N0Fi11XVHTNdaRWIcE2+jfyRg13SpsVupuTTPQ\nuvkLL+Fe3FX2jbpm3ouzpF7OgtRpltacOV7GjA6F8aJJHKGc5Ita614AVL8JbuV0\na36H54ABNxXdMFOcLxrN9ZRbsQWt1R6tZ43LeIBe39iMnLM2hYRLXop9C1oagWMa\nWgt2Eg75DkgC/CuH605rctLCBWHWr1O7NozINjecA6WvfO22V4C+N3VwkYUxTeK1\nIO6/yaLGyd9XP00a+wKYmvGepmVB7sTWK4ECAwEAAaNTMFEwDwYDVR0TBAgwBgEB\n/wIBADAdBgNVHQ4EFgQUV80EGqIC+c8uZFiw3BzGDJKxaXEwHwYDVR0jBBgwFoAU\nV80EGqIC+c8uZFiw3BzGDJKxaXEwDQYJKoZIhvcNAQELBQADggIBADYOfnlMLYNv\nySANqxMwYVfXtB0OV3DWPgxTUJ0/uahM/potWMZfAm1IvNz5g4rto1LgTPxyDODg\nAF2naQLT9mG5G4Tgjc1oLEDluzr/eWKqTKJMU2Ms/eLomgexk6pseS1fNYVc1YLV\nyfyWVcmEi1oR8Y5yvgDsYLtX091aIU00ASNfUX0ZuTSJaMK9DFlVYKbAHa1i+sDG\nA2kOhVtpmxcWz38hKY94/HmJas5k7Iij+eNjU43FZTv4pZsNx0P4b045imniZs8G\nLludi2JQWChaxFNCOEsny9HANzBmnP9TbuL0h8/xeKcyVahxmS3DKB+QWyLo0Awx\nZoc5nUZI3rWBKI1VtnBkrXLmjWSrTx40hGhcc4gZSefnAO9j7OdKNui9xbTjNI3g\nZolXjAW7Ab+FaWK/1UXAhwRelOqO4qFxkc3LeapRcazihG/l5AXu6vzWRo+gQnmZ\nQzJO01c3BvYur0XREx3U66QBfkskfZJoNgDTMsXk2o3l1l5B1M0E7xGr+16DupDP\nAekUfEM9AtxVz56vhLgcPBmk6LRcjIovVrxq6y5juuUKYYy042U/W59ZRuXAILXY\nS72Yk5StZOp0YmdwYLFxxNJBaSYoCK0ml4gqBlufQPViI5TQmSsMNA9qQcfc8L+c\na5OGi9XyMqgt3Wd/BoRcqTiOXrEu+RgJ\n-----END CERTIFICATE-----\n',''),rs=c(10407,-228603134,-1090082149,-107250400,1140708001,680843246,-434765929))\\\"\""
Browse[1]> c
[1] 1
[1] "mirai::daemon(\"wss://interactive-st-rstudio-1:34773\",tls=c('-----BEGIN CERTIFICATE-----\nMIIFVDCCAzygAwIBAgIBATANBgkqhkiG9w0BAQsFADBDMSEwHwYDVQQDDBhpbnRl\ncmFjdGl2ZS1zdC1yc3R1ZGlvLTExETAPBgNVBAoMCE5hbm9uZXh0MQswCQYDVQQG\nEwJKUDAeFw0wMTAxMDEwMDAwMDBaFw0zMDEyMzEyMzU5NTlaMEMxITAfBgNVBAMM\nGGludGVyYWN0aXZlLXN0LXJzdHVkaW8tMTERMA8GA1UECgwITmFub25leHQxCzAJ\nBgNVBAYTAkpQMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAlAKAsPEA\nmU685D40gGGS7gpkX/148fJyQXyP9iAnfFFBk0+oc4d77nEAsjsZNqyf0U0+BbXK\n37szO11mFXgzvwlTODh+U9KzIAJ+7Ki2rMM4blsAt8vVzXl9+sQb17hyBDntj5np\nxo7CHdwSdoHkA7SA8ZAt7avwI28uIWY0Ti7Hq9Wmc5bdLE2CkM9v4Qu2Bwj8lbGP\nUD7L17bcNQRofyxiJmT4kJKqNilPwhoAe7Y3O6VCRwDn0sH0aUAPuQR1EchGLnT1\nIYVRji5julw0xVAK3MR4qp2gmgMvLAVdCUVibEEdRevpiwotp5BbE5fTm+iC1P+v\ni472OXQ904Zqxda7umfbaxXdyC050ui+Gp88LRGDPamszXLRiQ8MaZ5/VNEIbXGR\ni091LsQSfGeQR1nDAyGcVF5ab8N0Fi11XVHTNdaRWIcE2+jfyRg13SpsVupuTTPQ\nuvkLL+Fe3FX2jbpm3ouzpF7OgtRpltacOV7GjA6F8aJJHKGc5Ita614AVL8JbuV0\na36H54ABNxXdMFOcLxrN9ZRbsQWt1R6tZ43LeIBe39iMnLM2hYRLXop9C1oagWMa\nWgt2Eg75DkgC/CuH605rctLCBWHWr1O7NozINjecA6WvfO22V4C+N3VwkYUxTeK1\nIO6/yaLGyd9XP00a+wKYmvGepmVB7sTWK4ECAwEAAaNTMFEwDwYDVR0TBAgwBgEB\n/wIBADAdBgNVHQ4EFgQUV80EGqIC+c8uZFiw3BzGDJKxaXEwHwYDVR0jBBgwFoAU\nV80EGqIC+c8uZFiw3BzGDJKxaXEwDQYJKoZIhvcNAQELBQADggIBADYOfnlMLYNv\nySANqxMwYVfXtB0OV3DWPgxTUJ0/uahM/potWMZfAm1IvNz5g4rto1LgTPxyDODg\nAF2naQLT9mG5G4Tgjc1oLEDluzr/eWKqTKJMU2Ms/eLomgexk6pseS1fNYVc1YLV\nyfyWVcmEi1oR8Y5yvgDsYLtX091aIU00ASNfUX0ZuTSJaMK9DFlVYKbAHa1i+sDG\nA2kOhVtpmxcWz38hKY94/HmJas5k7Iij+eNjU43FZTv4pZsNx0P4b045imniZs8G\nLludi2JQWChaxFNCOEsny9HANzBmnP9TbuL0h8/xeKcyVahxmS3DKB+QWyLo0Awx\nZoc5nUZI3rWBKI1VtnBkrXLmjWSrTx40hGhcc4gZSefnAO9j7OdKNui9xbTjNI3g\nZolXjAW7Ab+FaWK/1UXAhwRelOqO4qFxkc3LeapRcazihG/l5AXu6vzWRo+gQnmZ\nQzJO01c3BvYur0XREx3U66QBfkskfZJoNgDTMsXk2o3l1l5B1M0E7xGr+16DupDP\nAekUfEM9AtxVz56vhLgcPBmk6LRcjIovVrxq6y5juuUKYYy042U/W59ZRuXAILXY\nS72Yk5StZOp0YmdwYLFxxNJBaSYoCK0ml4gqBlufQPViI5TQmSsMNA9qQcfc8L+c\na5OGi9XyMqgt3Wd/BoRcqTiOXrEu+RgJ\n-----END CERTIFICATE-----\n',''),rs=c(10407,-228603134,-1090082149,-107250400,1140708001,680843246,-434765929))"
> daemons()
$connections
[1] 1

$daemons
                                     i online instance assigned complete
wss://interactive-st-rstudio-1:34773 1      0        0        0        0

What I can see on the SLURM side is that the SLURM job has completed sucessfully with a runtime of zero seconds. If I take the output string and run it against srun the mirai daemon becomes online, i.e.

srun --mem 512 /opt/R/4.3.2/bin/Rscript -e "mirai::daemon(\"wss://interactive-st-rstudio-1:39613\",tls=c('-----BEGIN CERTIFICATE-----\nMIIFVDCCAzygAwIBAgIBATANBgkqhkiG9w0BAQsFADBDMSEwHwYDVQQDDBhpbnRl\ncmFjdGl2ZS1zdC1yc3R1ZGlvLTExETAPBgNVBAoMCE5hbm9uZXh0MQswCQYDVQQG\nEwJKUDAeFw0wMTAxMDEwMDAwMDBaFw0zMDEyMzEyMzU5NTlaMEMxITAfBgNVBAMM\nGGludGVyYWN0aXZlLXN0LXJzdHVkaW8tMTERMA8GA1UECgwITmFub25leHQxCzAJ\nBgNVBAYTAkpQMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAkeXSYgHO\ntE7nNPpinWIfJ7u4uPR8SKraYYUkNjI0VFBwlVIz3cXFTiYZ5QDpar613zwKzjWh\nFj10oA9xP/c2pOikE8ODelOBAscvE6ldBv1FbDeMDH+nJbnSMPEyVXCQc84eGYZo\nx81d+t8meHgMQV1iaufp1fH92Xj00iL/PPp3dO37XJ+SPk2eTVp/f9kvo24ckNir\nPy0CQcNjqz5qrmSRDFulB35/gDHW+asJ/YhDDFyhMxLfXnBrtRjpIygme3YL4WP6\ntrq3GfQK7AyAkOE6b3p/cQ8A/vbvZ2cYQH1tox9/ILNIiN+0KoB/qirLT4sg0oAt\ni8s1Kf8Nb0Kpe/4xnHelpxEf5sKmu61YS87XTyZKMYyXusGwoowdmj5ro73l/Z9g\nZn+tiT6PXPtSgUwf6fJugBF9mVTKjBB038tPQCJ3nnKgAnooGZAn+YZ43e7nXCl5\nDPQzkpnutzamSes6BsPF34VoBWtX1K4NWxacaUbpy7K5cDHmbhOMNPDmFe60BXNH\nq95ZsMJupw0LA2qhlEu5gRaC1uD1NIYlINYvVcL9+fV2ExElN0B0VTsAJyIbEI2M\nHocaDShi/mlXnqEFqmgN4EElzdiiVdt/cCMN0FgcB3VQfCpH2kZEPkN4SPX5xmJd\nENRh00NaLf8cBdD9b8rQBPotTDv6ZnOtnpECAwEAAaNTMFEwDwYDVR0TBAgwBgEB\n/wIBADAdBgNVHQ4EFgQUTc4OuPP9aZ52gQNRzN2PN25E4v0wHwYDVR0jBBgwFoAU\nTc4OuPP9aZ52gQNRzN2PN25E4v0wDQYJKoZIhvcNAQELBQADggIBABUqsIzwQ7jB\nkLC9C1nS35aXN0Nq5ZqFUAI+yoo5M5aqMWYgCWpeQHGf09gFZj+C7jzanuxAHZ3R\nkYgh9WSGQo1DFIy6O4VUZw3evLQugYifzayU0XoTeGjpeG98SbfZ8ZeXHtjMQ3Op\nGF79wH386PnVwswB+faM74pu5KwnkFGq9EKnnKuLxlsF7+KjI160K6jGtU2XnVyP\nl/3PH/I/3WdPWmlMcH3LUIMT2pb5cuhlNpWro/oGPX+XDk/h1oMfGbSHg6d0rOjQ\nYqazx2k8tWPACWPSap5+BRB8gk6b+BeZ6+mdjgihXhAbx7Oz+QsCZSKgQHJtM55v\nomasn//ACC6WA9dSxfB1LdER8Lq3WQxn+zqdbFdmOAbdnl0JRgc5ZduupoHpKCgP\nqnzlnBpvZf7gnLp2ziwfx2AKQzM8NZOVEWwcM7DCsS/CTkpJ7KVbv+E8VGpO38YZ\npzH1B0LYuQbIFtcQgUEl7tx6p3IktycHFZlZ2YJjhTwOMuWh9jDS34EfHw+phaVX\nWLlWgdAnM08SHku5MvDO1fVNEuJ/l8jCyqtjf+aztEeqlErsNV2ceMSv4x7JQ2ef\n6Lq9SH4aJuW+ZO1spx7JhLDfPu8LkokAvC0Y+F72NILRvpwUq6nOxGxTk0i8z5Pg\nit3eJIL31bwERqXNsOTq12ROk7Q/EiLd\n-----END CERTIFICATE-----\n',''),rs=c(10407,1180069229,607201610,-1547644349,-1101478232,-1244291959,294276790))"

works and also polls indefinitely, i.e. does not stop until you stop the mirai cluster.

shikokuchuo commented 5 months ago

Thanks for the detailed output. From the above I also spotted that I hadn't switched the quoting behaviour for the TLS certificates.

I'm about to push a fix for this entire issue. There are two different behaviours required for ssh and srun respectively, and they will be governed by an additional argument in remote_config(), with the defaults working in each case.

michaelmayer2 commented 2 weeks ago

I came back to this and I figured out an alternative way to run SLURM tasks would be to use

remote = remote_config(
      command = "sbatch",
      args = c("--mem 512", "-n 1", "--wrap", "."),
      rscript = file.path(R.home("bin"), "Rscript"),
      quote = TRUE
    ),

The benefit of using sbatch is that the job accounting information is much more accessible when compared to srun.

shikokuchuo commented 2 weeks ago

Thanks Michael. Just to confirm my understanding, you're saying that using sbatch is preferable and this requires remote_config(quote = TRUE)? If that's the case then I'd like to make quote = TRUE the default as it seems it's also the case for most other commands.

michaelmayer2 commented 2 weeks ago

Thanks Michael. Just to confirm my understanding, you're saying that using sbatch is preferable and this requires remote_config(quote = TRUE)? If that's the case then I'd like to make quote = TRUE the default as it seems it's also the case for most other commands.

Yes, sbatch feels very much preferrable to me and requires remote_config(quote=TRUE)

shikokuchuo commented 2 weeks ago

I've realised it's too disruptive to change the default, and quote = FALSE is actually used in some other contexts, so I've just added a Slurm sbatch example to the remote_config() docs. Thanks for reporting back on this @michaelmayer2!