mschubert / clustermq

R package to send function calls as jobs on LSF, SGE, Slurm, PBS/Torque, or each via SSH
https://mschubert.github.io/clustermq/
Apache License 2.0
146 stars 27 forks source link

`Q` stalls due to Segmentation fault in job #80

Closed nick-youngblut closed 6 years ago

nick-youngblut commented 6 years ago

The Q function is stalling during the simple demo:

fx = function(x) x * 2
Q(fx, x=1:3, n_jobs=1)

The progress bar stays a 0%. The qsub job starts and errors out. The error is the following:

/var/spool/gridengine/execd/node514/job_scripts/1191974: line 8: ulimit: virtual memory: cannot modify limit: Operation not permitted
WARNING: ignoring environment value of R_HOME
/var/spool/gridengine/execd/node514/job_scripts/1191974: line 9: 63852 Segmentation fault      R --no-save --no-restore -e 'clustermq:::worker("tcp://rick.eb.local:7775")'

My job template file:

#!/bin/bash
#$ -N {{ job_name }}               # job name
#$ -l h_rt=0:50:0                  # job time
#$ -j y                            # combine stdout/error in one file
#$ -o {{ log_file | /dev/null }}   # output file
#$ -cwd                            # use pwd as work dir
#$ -V                              # use environment variable

#source activate {{ conda | base }}
ulimit -v $(( 1024 * {{ memory | 4096 }} ))
R --no-save --no-restore -e 'clustermq:::worker("{{ master }}”)'

SessionInfo:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] clustermq_0.8.3

loaded via a namespace (and not attached):
[1] R6_2.2.2      tools_3.3.2   rzmq_0.9.3    infuser_0.2.8

I'm using R installed via conda. My conda env info is:

     active environment : r_install
    active env location : /ebio/abt3_projects/software/dev/miniconda3_dev/envs/r_install
            shell level : 1
       user config file : /ebio/abt3/nyoungblut/.condarc
 populated config files : /ebio/abt3_projects/software/dev/miniconda3_dev/.condarc
                          /ebio/abt3/nyoungblut/.condarc
          conda version : 4.5.0
    conda-build version : 3.8.0
         python version : 3.6.5.final.0
       base environment : /ebio/abt3_projects/software/dev/miniconda3_dev  (writable)
           channel URLs : https://conda.anaconda.org/bioconda/linux-64
                          https://conda.anaconda.org/bioconda/noarch
                          https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/free/linux-64
                          https://repo.anaconda.com/pkgs/free/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
                          https://repo.anaconda.com/pkgs/pro/linux-64
                          https://repo.anaconda.com/pkgs/pro/noarch
                          https://conda.anaconda.org/r/linux-64
                          https://conda.anaconda.org/r/noarch
                          https://conda.anaconda.org/qiime2/linux-64
                          https://conda.anaconda.org/qiime2/noarch
          package cache : /ebio/abt3_projects/software/dev/miniconda3_dev/pkgs
                          /ebio/abt3/nyoungblut/.conda/pkgs
       envs directories : /ebio/abt3_projects/software/dev/miniconda3_dev/envs
                          /ebio/abt3/nyoungblut/.conda/envs
               platform : linux-64
             user-agent : conda/4.5.0 requests/2.18.4 CPython/3.6.5 Linux/4.4.67 ubuntu/16.04 glibc/2.23
                UID:GID : 6354:350
             netrc file : None
           offline mode : False

I had to install clustermq in R with install.packages(), because there's no conda package for clustermq.

mschubert commented 6 years ago

Can you add the log file for

Q(fx, x=1:3, n_jobs=1, log_worker=TRUE)

?

In general, I don't see how clustermq can cause segfaults, because it does not contain compiled code. Are you sure rzmq was installed correctly and their example on their README works? This is likely an issue between ZeroMQ/rzmq and conda.

nick-youngblut commented 6 years ago

Sorry for not being clear in my original post. The log file is:

/var/spool/gridengine/execd/node514/job_scripts/1191974: line 8: ulimit: virtual memory: cannot modify limit: Operation not permitted
WARNING: ignoring environment value of R_HOME
/var/spool/gridengine/execd/node514/job_scripts/1191974: line 9: 63852 Segmentation fault      R --no-save --no-restore -e 'clustermq:::worker("tcp://rick.eb.local:7775")'

I haven't tried the rmzq example, but that could definitely be the problem.

mschubert commented 6 years ago

I meant the clustermq worker log file, not the job log file

mschubert commented 6 years ago

@nick-youngblut Any news on the log file?

I'd expect to see at least the R startup messages, otherwise your R itself may be borked.

When I run this, I see in cmq6557.log:

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
[...]

> clustermq:::worker("tcp://pg-node062:6557")
Master: tcp://pg-node062:6557
WORKER_UP to: tcp://pg-node062:6557
> DO_SETUP (0.000s wait)
[...]
nick-youngblut commented 6 years ago

Sorry about the slow reply. I've been working on getting batchtools to work and haven't had time to look further into this yet. I'm guessing that the problem is rzmq, but I need to look more into the problem.

mschubert commented 6 years ago

Closing this due to inactivity. Please reopen if the log shows this is a clustermq problem.