slurm: Socket timed out on send/recv operation

jasonserviss commented 6 years ago

Hello! I recently posted an issue on the future.batchtools repo and was advised that it might be better addressed here. The full contents are included below:

Background and details:

I am currently developing a package which includes a function that uses future_lapply. I have a script that performs some testing (i.e. loads packages and data, plans the future, and calls the function in my package that utilises future_lapply). I run the script on the log-in node of the remote HPC as follows:

Rscript --vanilla my_script.R &> logs.txt &

Within my_script.R the future is planned with the following:

future::plan(
  future.batchtools::batchtools_slurm,
  template = "/crex/proj/snic2018-8-151/private/batchtools.slurm.tmpl",
  resources = list(
    account = "snic2018-8-151", partition = "core", ntasks = 1L,
    time = "24:00:00", jobname = "testingPoissonSorted",
    modules = "R_packages/3.5.0", R = "R/3.5.0", log.file = file.path(currPath, "logs/slurm.txt")
 ))

And my batchtools template includes:

\#!/bin/bash                                                                                                                                                                                                                                                    

\#SBATCH --job-name=<%= resources$jobname %>                                                                                                                                                                                                                    
\#SBATCH --time=<%= resources$time %>                                                                                                                                                                                                                           
\#SBATCH --ntasks=<%= resources$ntasks %>                                                                                                                                                                                                                       
\#SBATCH --account=<%= resources$account %>                                                                                                                                                                                                                     
\#SBATCH --partition=<%= resources$partition %>                                                                                                                                                                                                                 
\#SBATCH --output=<%= resources$log.file %>                                                                                                                                                                                                                     

source /etc/profile                                                                                                                                                                                  
module add <%= paste(resources$modules, resources$R) %>                                                                                                                                        
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

Error and "expected" response:

While testing the package on a Slurm-based system I get the following error (reported in logs.txt):

Error in OSError("Listing of jobs failed", res) : Listing of jobs failed (exit code 1); cmd: 'squeue --user=$USER --states=R,S,CG --noheader --format=%i -r' output: slurm_load_jobs error: Socket timed out on send/recv operation Calls: spSwarm ... unique -> -> listJobs -> OSError -> stopf Execution halted

It seems that this arises when the schedular is "busy" and, I am guessing, the communication between batchtools and slurm times out. Since the jobs have already started, they continue to run until completion although, because of the apparent "disconnection" between batchtools and slurm, the output from the run (saved downstream in the script after the call to future_lapply) is never generated. This is somewhat painful for those of us working on a limited number of core hours per month as the hours get "spent" without any "result" being generated.

After a bit of looking around it would seem that the sysadmins may be able to help prevent this (bug report) but, in my humble opinion, it would also be ideal if a) the user can do something to increase the time that future/batchtools waits for a response from the schedular or b) that the jobs are automatically canceled if batchtools looses contact with the schedular (potentially problematic since contact is lost).

For the sake of completeness the output from batchtools is included below:

WARNING: ignoring environment value of R_HOME [bt]: This is batchtools v0.9.10 [bt]: Starting calculation of 1 jobs [bt]: Setting working directory to '/crex/proj/snic2018-8-151/private/CIMseq/testingPoissonMouse' Loading required package: sp.scRNAseq [bt]: Memory measurement disabled [bt]: Starting job [batchtools job.id=1] [bt]: Setting seed to 1 ...

[bt]: Job terminated successfully [batchtools job.id=1] [bt]: Calculation finished!

[bt]: Job terminated successfully [batchtools job.id=1] [bt]: Calculation finished!

Disclaimer:

I wouldn't consider myself an experienced HPC/future/batchtools user and this could be potentially "all my fault" and painfully obvious to more experienced users and, if that is the case, hopefully submitting this issue will guide others of a similar experience level in the right direction in the future .

Kind Regards, Jason

mllg commented 6 years ago

Thanks for the detailed report. I'll include a configuration setting to deal with time outs in the next release.

jasonserviss commented 6 years ago

Thanks for your quick response! Thats great news! I will get in touch with the system admins in the meantime and see if they can help in any way until the next release occurs.

jasonserviss commented 6 years ago

The system admins were quick to respond! I will include a quick summary of their response here just in case it might be helpful to you:

1) "We are running with a MessageTimeout of 100" 2) "I ran the squeue command and it was slow but finished without a timeout." 3) "We have however struggled with timeouts similar to yours before, and we know a bit about why it occurs. The Slurm master becomes overloaded with RPCs, mostly because of users polling the master with questions (squeue, scontrol, sacct etc) and very small, short jobs seems to increase it further. We do believe this can be configured or optimized away, and we will revisit this question by the end of the summer when everyone is back" 4) "It would not surprise me if your job works fine on some occasions and not at all on others."

nick-youngblut commented 6 years ago

It appears that I'm running into a similar problem. I'm running future_lapply() with resources set as:

# cluster resources
resources = list(h_rt = '24:00:00',
                 h_vmem = '24G',
                 threads = '1',
                 conda.env = "py3_physeq_eco",
                 conda.path = "/ebio/abt3_projects/software/miniconda3/bin")
plan(batchtools_sge, resources=resources)

I'm getting the following errors (sometimes):

Error in OSError("Listing of jobs failed", res): Listing of jobs failed (exit code 1);
cmd: 'qstat -u $USER -s p'
output:
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "ilm.eb.local": got send error
Traceback:

1. future_lapply(cooccur_res, FUN = function(x) {
 .     x %>% effect.sizes
 . }, future.packages = c("cooccur"))
2. values(fs)
3. values.list(fs)
4. resolve(y)
5. resolve.list(y)
6. resolved(obj)
7. resolved.BatchtoolsFuture(obj)
8. finished(x)
9. finished.BatchtoolsFuture(x)
10. status(future)
11. status.BatchtoolsFuture(future)
12. get_status(reg = reg, ids = jobid)
13. batchtools::getStatus(...)
14. getStatusTable(convertIds(reg, ids), reg = reg)
15. merge(filter(reg$status, ids), batch.ids, by = "batch.id", all.x = TRUE, 
  .     all.y = FALSE, sort = FALSE)
16. merge.data.table(filter(reg$status, ids), batch.ids, by = "batch.id", 
  .     all.x = TRUE, all.y = FALSE, sort = FALSE)
17. is.data.table(y)
18. getBatchIds(reg = reg)
19. chsetdiff(cf$listJobsQueued(reg), tab$batch.id)
20. cf$listJobsQueued(reg)
21. listJobs(reg, c("-u $USER", "-s p"))
22. OSError("Listing of jobs failed", res)
23. stopf("%s (exit code %i);\ncmd: '%s'\noutput:\n%s", msg, res$exit.code, 
  .     stri_flatten(c(res$sys.cmd, res$sys.args), collapse = " "), 
  .     stri_flatten(res$output, "\n"))

My job template is:

#!/bin/bash

## The name of the job, can be anything, simply used when displaying the list of running jobs
#$ -N <%= job.name %>

## Combining output/error messages into one file
#$ -j y

## Giving the name of the output log file
#$ -o <%= log.file %>

## One needs to tell the queue system to use the current directory as the working directory
## Or else the script may fail as it will execute in your top level home directory /home/username
#$ -cwd

## Use environment variables
#$ -V

## Number of threads
#$ -pe parallel <%= resources$threads %>

## Time
#$ -l h_rt=<%= resources$h_rt %>

## Memory
#$ -l h_vmem=<%= resources$h_vmem %>

export PATH=<%= resources$conda.path %>:$PATH
source activate <%= resources$conda.env %>

## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>

<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>

Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
exit 0

My default config is:

default.resources = list(h_rt = '00:59:00',
                         h_vmem = '4G',
             threads = '1',
                         conda.env = "py3",
                         conda.path = "/ebio/abt3_projects/software/miniconda3/bin")
cluster.functions = makeClusterFunctionsSGE(template = "~/.batchtools.tmpl")
temp.dir = "/ebio/abt3_projects/temp_data/"

...and finally sessionInfo:

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04 LTS

Matrix products: default
BLAS: /ebio/abt3_projects/software/miniconda3/envs/py3_physeq_eco/lib/R/lib/libRblas.so
LAPACK: /ebio/abt3_projects/software/miniconda3/envs/py3_physeq_eco/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2            future.apply_1.0.0      future.batchtools_0.7.1
 [4] future_1.8.1            cooccur_1.3             networkD3_0.4          
 [7] phyloseq_1.22.3         ggplot2_3.0.0           tidyr_0.7.1            
[10] dplyr_0.7.4            

loaded via a namespace (and not attached):
 [1] Biobase_2.38.0      jsonlite_1.5        splines_3.4.1      
 [4] foreach_1.4.3       assertthat_0.2.0    stats4_3.4.1       
 [7] base64url_1.3       progress_1.2.0      globals_0.12.1     
[10] backports_1.1.2     pillar_1.2.2        lattice_0.20-35    
[13] glue_1.2.0          uuid_0.1-2          digest_0.6.13      
[16] checkmate_1.8.5     XVector_0.18.0      colorspace_1.3-2   
[19] htmltools_0.3.6     Matrix_1.2-7.1      plyr_1.8.4         
[22] pkgconfig_2.0.1     listenv_0.7.0       zlibbioc_1.24.0    
[25] purrr_0.2.2         scales_0.5.0        brew_1.0-6         
[28] tibble_1.4.2        mgcv_1.8-17         gmp_0.5-13.2       
[31] IRanges_2.12.0      withr_2.1.1         repr_0.10          
[34] BiocGenerics_0.24.0 lazyeval_0.2.0      survival_2.40-1    
[37] magrittr_1.5        crayon_1.3.4        evaluate_0.10.1    
[40] doParallel_1.0.10   nlme_3.1-137        MASS_7.3-45        
[43] vegan_2.4-3         prettyunits_1.0.2   tools_3.4.1        
[46] data.table_1.11.4   hms_0.3             stringr_1.2.0      
[49] S4Vectors_0.16.0    munsell_0.4.3       cluster_2.0.6      
[52] Biostrings_2.46.0   ade4_1.7-8          compiler_3.4.1     
[55] rlang_0.2.0         rhdf5_2.22.0        grid_3.4.1         
[58] pbdZMQ_0.3-1        iterators_1.0.8     IRkernel_0.8.11    
[61] biomformat_1.6.0    rappdirs_0.3.1      htmlwidgets_1.0    
[64] igraph_1.2.1        gtable_0.2.0        codetools_0.2-15   
[67] multtest_2.34.0     reshape2_1.4.2      R6_2.2.2           
[70] bindr_0.1.1         permute_0.9-4       ape_5.0            
[73] stringi_1.1.6       parallel_3.4.1      IRdisplay_0.4.4    
[76] Rcpp_0.12.14        batchtools_0.9.8

nick-youngblut commented 6 years ago

One more bit of info: I seem to only get this error when I'm running future_lapply() in multiple Jupyter notebooks at the same time (eg., 3 future_lapply() calls, each running 100 jobs). Since I'm just getting the error in this situation, it suggests that I'm overloading the system, at least with the current batchtools settings.

HenrikBengtsson commented 6 years ago

The future_lapply() polls each future on a regular basis to check whether it is resolved or not - that's the resolved(<future>) call - and this will keep repeating until the future is resolved and the results can be collected by value(<future>).

You can control for how long and how frequent futures a polled using options (from ?future::future.options):

future.wait.timeout: Maximum waiting time (in seconds) for a free worker before a timeout error is generated. (Default: 30 * 24 * 60 * 60 (= 30 days))
future.wait.interval: Initial interval (in seconds) between polls. ~~(Default: 0.2 = 0.2 seconds)~~
future.wait.alpha: Positive scale factor used to increase the interval after each poll. (Default: 1.01)

In your case, by increasing future.wait.interval you'll hit the queue server less frequently.

NOTE: While writing this reply, I realized that the above default is different for the future.batchtools package, which defaults to an interval of 1.0 seconds (not 0.2 seconds) (I'll document this: https://github.com/HenrikBengtsson/future.batchtools/issues/28). So, if you set options(future.wait.interval = 2.0) you'll lower the amount of querying to the queue server to 50% of what you're doing right now, and so on.

Another comment, when using plan(batchtools_sge, resources = resources) the number of workers the future framework sees is infinite (nbrOfWorkers() = +Inf). This means that if you do future_lapply(1:1000, ...), there will 1,000 jobs submitted to the queue. If you use plan(batchtools_sge, resources = resources, workers = 20), then nbrOfWorkers() = 20. This will cause future_lapply(1:1000, ...) to launch 20 futures (=jobs) each processing 50 elements. This approach would also lower the total number of queries that the queue server gets.

nick-youngblut commented 6 years ago

Thanks for the quick reply! I'll give future.wait.interval a try.

mllg commented 6 years ago

It looks like the scheduler is simply overburdened in your setups. As @HenrikBengtsson already suggested, you should make use of chunking if you have many jobs (by setting the nbrOfWorkers option for future).

On the batchtools side it is unclear how to deal with this situation as I need a list of running jobs for different operations. For some operations it would make sense to fall back to a list of "all" jobs, for others this may potentially lead to disaster. I've implemented the workaround of returning all candidate jobs which potentially are queued or running in the branch squeue_timeout to experiment with it.

nick-youngblut commented 6 years ago

Thanks for implementing squeue_timeout! So far, using options(future.wait.interval = 2.0) works for running 200-300 jobs in parallel. I'll try playing around with squeue_timeout

HenrikBengtsson commented 6 years ago

@mllg, not sure if you suggested that in your comment, and not sure if it makes a difference for all schedulers, but do you think the overall load on the scheduler would go down if one queried for multiple jobs each time (e.g. "qstat ") and record/cache internally, and thereby lower the total number of "qstat" calls performed?

I know it's a very vague question hard to answer, but I wanted to get the idea out there. I know I thought about this idea a couple of years ago and then forgot about it until this issue came up again.

mllg commented 6 years ago

@mllg, not sure if you suggested that in your comment, and not sure if it makes a difference for all schedulers, but do you think the overall load on the scheduler would go down if one queried for multiple jobs each time (e.g. "qstat ") and record/cache internally, and thereby lower the total number of "qstat" calls performed?

I don't use qstat, only one call to squeue --user=$USER [more filter options] to query all job states at once. To reduce the load, you can set a custom value in your configuration file to determine how often the scheduler is queried for running jobs:

sleep = 120 # once every 2 mins

jasonserviss commented 6 years ago

I will tentatively say that the future.wait.interval solution seems to be working for me as well. It is a bit hard to say 100% since the system needs to be sufficiently busy in order for the problem to occur and I can't really know for sure if it is or not. Despite this, I have been able to run several jobs without any problem now so, fingers crossed, this would seem to be resolved.

jasonserviss commented 6 years ago

"As luck would have it", I ran into this today with options(future.wait.interval = 10.0) and workers = 100 with other settings as in the original post:

Error in OSError("Listing of jobs failed", res) : Listing of jobs failed (exit code 1); cmd: 'squeue --user=$USER --states=PD --noheader --format=%i -r' output: slurm_load_jobs error: Socket timed out on send/recv operation Calls: future_lapply ... chsetdiff -> <Anonymous> -> listJobs -> OSError -> stopf Execution halted

I am willing to try to help diagnose the problem further if possible, so just let me know if there is anything else I can do.

damirpolat commented 4 years ago

I have the same "socket timed out on send/recv operation" when using batchtools (without future.bacthools). I tried using squeue_timeout branch and now get a different kind of error:

Error: Listing of jobs failed (exit code 1);
cmd: 'squeue --user=$USER --states=PD --noheader --format=%i -r'
output:
slurm_load_jobs error: Unable to contact slurm controller (connect failure)

Here's my slurm template:

#!/bin/bash
<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>

#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --time=<%= resources$walltime %>
#SBATCH --cpus-per-task=<%= resources$ncpus %>
#SBATCH --mem-per-cpu=<%= resources$memory %>
##SBATCH --mail-type=BEGIN,END,FAIL
##SBATCH --mail-user=email@university.edu
#SBATCH -A mallet
<%= if (!is.null(resources$partition)) sprintf(paste0("#SBATCH --partition='", resources$partition, "'")) %>

## Initialize work environment like
## source /etc/profile
## module add ...

## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>

<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>

## Run R:
## we merge R output with stdout from SLURM, which gets then logged via --output option
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

and my sessionInfo():

R version 3.6.1 (2019-07-05)                                                                                                                                                                                                                                                              
Platform: x86_64-conda_cos6-linux-gnu (64-bit)                                                                                                                                                                                                                                            
Running under: Storage                                                                                                                                                                                                                                                                    

Matrix products: default                                                                                                                                                                                                                                                                  
BLAS/LAPACK: /pfs/tsfs1/home/dpulatov/.conda/envs/parallelMap-batchtools-timeout/lib/R/lib/libRblas.so                                                                                                                                                                                    

locale:                                                                                                                                                                                                                                                                                   
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                                                                                                                                                                                                                                           
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8                                                                                                                                                                                                                                 
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8                                                                                                                                                                                                                                
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8                                                                                                                                                                                                                                    
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8                                                                                                                                                                                                                               
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8                                                                                                                                                                                                                          

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] batchtools_0.9.12      reshape2_1.4.4         data.table_1.12.8
 [4] llama_0.9.2            aslib_0.1              BBmisc_1.11
 [7] optparse_1.6.6         parallelMap_1.5.0.9000 dplyr_0.8.5
[10] plyr_1.8.6             mlr_2.17.1             ParamHelpers_1.14

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6      pillar_1.4.3      compiler_3.6.1    RWekajars_3.9.3-2
 [5] prettyunits_1.1.1 progress_1.2.2    tools_3.6.1       corrplot_0.84
 [9] digest_0.6.25     lifecycle_0.2.0   tibble_3.0.1      gtable_0.3.0
[13] checkmate_2.0.0   lattice_0.20-38   pkgconfig_2.0.3   rlang_0.4.5
[17] Matrix_1.2-17     fastmatch_1.1-0   yaml_2.2.1        parallel_3.6.1
[21] rJava_0.9-11      stringr_1.4.0     withr_2.2.0       rappdirs_0.3.1
[25] hms_0.5.3         vctrs_0.2.4       grid_3.6.1        tidyselect_1.0.0
[29] getopt_1.20.3     glue_1.4.0        R6_2.4.1          base64url_1.4
[33] survival_2.44-1.1 RWeka_0.4-42      ggplot2_3.3.0     purrr_0.3.4
[37] magrittr_1.5      backports_1.1.6   scales_1.1.0      ellipsis_0.3.0
[41] splines_3.6.1     assertthat_0.2.1  colorspace_1.4-1  brew_1.0-6
[45] stringi_1.4.6     munsell_0.5.0     crayon_1.3.4

My questions are:

(1) What is currently the best way of dealing with socket timed out on send/recv operation error? (2) What can I look into to resolve slurm_load_jobs error: Unable to contact slurm controller (connect failure) issue when using squeue_timeout branch?

I don't know if this belongs here because it's a different error, but I will open another issue if not.

belanasaikiran commented 1 year ago

Hi, I'm new to slurm and have setup a cluster with one controller and two compute nodes.

However, when I try to run anything with srun , it says socket timed out on send/recv operation . Here's the log:

fpga@ubuntu2:~$ srun hostname
srun: error: slurm_receive_msgs: [[ubuntu4]:6818] failed: Socket timed out on send/recv operation
srun: error: Task launch for StepId=34.0 failed on node ubuntu4: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Can you help resolving this issue.

Thanks, Saikiran

belanasaikiran commented 1 year ago

I found that the munge keys are different from the controller to compute nodes. The solution was to use the same key and make sure the key permissions are read-only for the munge user

mllg / batchtools