radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Enable multithreaded execution for PSU use case on Cheyenne #88

Closed mturilli closed 5 years ago

andre-merzky commented 5 years ago

I pushed some changes to the fix/cheyenne branch (in the mpirun_dplace launch method) which addresses core pinning for multithreaded tasks. Can you give this a try, please?

Weiming-Hu commented 5 years ago

Sorry for my delay.

I have updated the package but I received the following problem when I tried to run entk-version.

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/__init__.py", line 4, in <module>
    from radical.entk.pipeline.pipeline import Pipeline
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/pipeline/pipeline.py", line 1, in <module>
    import radical.utils as ru
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/__init__.py", line 14, in <module>
    from .plugin_manager import PluginManager
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/plugin_manager.py", line 14, in <module>
    from .logger import Logger
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/logger.py", line 45, in <module>
    from   .misc    import get_env_ns       as ru_get_env_ns
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/misc.py", line 13, in <module>
    from .ru_regex import ReString
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/ru_regex.py", line 7, in <module>
    import regex
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/__init__.py", line 1, in <module>
    from .regex import *
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/regex.py", line 391, in <module>
    import _regex_core
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex_core.py", line 21, in <module>
    import _regex
ImportError: /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex.so: undefined symbol: _intel_fast_memcpy

I also tried to delete my old virtual environment and start a new one but the same error persists.

I tried to run radical-stack, but also ended up with the same error ImportError: /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex.so: undefined symbol: _intel_fast_memcpy.

andre-merzky commented 5 years ago

This is surprising. Could it be that the python version used when installing the virtualenv is different than the one used when running the above? Like, different module loaded, or a different compiler module loaded?

Weiming-Hu commented 5 years ago

But I have purged my modules and recreated a new virtual environment.

This is the script I used to install packages.

Thank you

andre-merzky commented 5 years ago

Do you also do the module purge && module load python/2.7.15 when you use that virtualenv and run the code?

Weiming-Hu commented 5 years ago

No. After I activated the virtualenv, I didn't purge modules again and load python. I just tried to purge and load again after activating the virtualenv, and then it is not able to find any packages, including entk packages.

andre-merzky commented 5 years ago

This is also unexpected. I would guess that the installation did not go into that VE possibly. Can you check if it ended up somewhere under $HOME/.local/lib/?

Either way though, I would recommend to start from scratch, and to include entk installation and the verification in your deployment script:

# prepare VE
module purge && module load python/2.7.15
virtualenv venv
source venv/bin/activate

# Install entk and dependencies
pip install radical.entk pyyaml netcdf4

# replace RP version
git clone https://github.com/radical-cybertools/radical.pilot.git
cd radical.pilot
git checkout fix/cheyenne
pip uninstall -y radical.pilot
pip install .

# verify installation
python -V
radical-stack
Weiming-Hu commented 5 years ago

Looks like I don't have the folder $HOME/.local/lib/.

I tried to start a new session, and use the script to create a new virtualenv. But I still get the errors.

(venv) wuh20@cheyenne2:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> python -V
Python 2.7.15
(venv) wuh20@cheyenne2:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> radical-stack 
Traceback (most recent call last):
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/bin/radical-stack", line 3, in <module>
    import radical.utils as ru
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/__init__.py", line 14, in <module>
    from .plugin_manager import PluginManager
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/plugin_manager.py", line 14, in <module>
    from .logger import Logger
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/logger.py", line 45, in <module>
    from   .misc    import get_env_ns       as ru_get_env_ns
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/misc.py", line 13, in <module>
    from .ru_regex import ReString
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/ru_regex.py", line 7, in <module>
    import regex
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/__init__.py", line 1, in <module>
    from .regex import *
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/regex.py", line 391, in <module>
    import _regex_core
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex_core.py", line 21, in <module>
    import _regex
ImportError: /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex.so: undefined symbol: _intel_fast_memcpy

Again, after the installation, I still don't have the folder $HOME/.local/lib/.

Thank you.

andre-merzky commented 5 years ago

I'll be back on my computer in about 30 min. Can you send me a module list you get after a fresh login, and possibly also after the module load in the install script you use? I will try to reproduce this.

Weiming-Hu commented 5 years ago

Thank you very much.

wuh20@sapphire:~$ cheyenne 
Last login: Wed May  1 09:56:40 2019 from 128.118.54.223
******************************************************************************
*                 Welcome to Cheyenne - April 30, 2019
******************************************************************************
                 Today in the Daily Bulletin (dailyb.cisl.ucar.edu)

        - Reminder: Cheyenne compute nodes down May 6-11 during NWSC electrical repairs
        - Alternative HPC login nodes now available
    - Default Cheyenne and Casper libraries will be updated May 6
        - Tutorial for new Cheyenne and Casper users
        - Best practice: Use scratch space for temporary files

Quick Start:          www2.cisl.ucar.edu/resources/cheyenne/quick-start-cheyenne
User environment:     www2.cisl.ucar.edu/resources/cheyenne/user-environment
Key module commands:  module list, module avail, module spider, module help
CISL Help:            cislhelp@ucar.edu -- 303-497-2400
------------------------------------------------------------------------------------
Restoring modules from user's default, for system: "ch"
wuh20@cheyenne2:~> module list

Currently Loaded Modules:
  1) ncarenv/1.2   2) intel/17.0.1   3) ncarcompilers/0.4.1   4) mpt/2.19   5) netcdf/4.6.1

wuh20@cheyenne2:~> module purge && module load python/2.7.15                                  
wuh20@cheyenne2:~> module list

Currently Loaded Modules:
  1) python/2.7.15
andre-merzky commented 5 years ago

Thank you!

andre-merzky commented 5 years ago

There is something funny going on with your account I think. This is what your procedure looks for me (I shortened insuspicious output):

$ ssh cheyenne
Token_Response:
Last login: Mon May  7 03:25:26 2018 from 138.201.86.166
...
Resetting modules to system default
cheyenne4  amerzky  ~   $ module list

Currently Loaded Modules:
  1) ncarenv/1.2   2) intel/17.0.1   3) ncarcompilers/0.4.1   4) mpt/2.19   5) netcdf/4.6.1

cheyenne4  amerzky  ~   $ module purge && module load python/2.7.15
cheyenne4  amerzky  ~   $ module liist

Currently Loaded Modules:
  1) python/2.7.15

cheyenne4  amerzky  ~   $ virtualenv ve > /dev/null
cheyenne4  amerzky  ~   $ source ve/bin/activate
(ve)  cheyenne4  amerzky  ~   $ pip install radical.entk > /dev/null
...
(ve)  cheyenne4  amerzky  ~   $ entk-version
0.7.16

(ve)  cheyenne4  amerzky  ~   $ radical-stack

  python               : 2.7.15
  pythonpath           :
  virtualenv           : /gpfs/u/home/amerzky/ve

  radical.entk         : 0.7.16
  radical.pilot        : 0.60.1
  radical.saga         : 0.60.0
  radical.utils        : 0.60.1

The installation of the RP branch also does not make a difference:

(ve)  cheyenne4  amerzky  ~   $ git clone https://github.com/radical-cybertools/radical.pilot.git
Cloning into 'radical.pilot'...
...
(ve)  cheyenne4  amerzky  ~   $ cd radical.pilot
(ve)  cheyenne4  amerzky  ~/radical.pilot  [devel] $ git checkout fix/cheyenne
Branch fix/cheyenne set up to track remote branch fix/cheyenne from origin.
Switched to a new branch 'fix/cheyenne'

(ve)  cheyenne4  amerzky  ~/radical.pilot  [fix/cheyenne] $ pip install . --upgrade
...
Successfully installed radical.pilot-0.60.1

(ve)  cheyenne4  amerzky  ~/radical.pilot  [fix/cheyenne] $ radical-stack

  python               : 2.7.15
  pythonpath           :
  virtualenv           : /gpfs/u/home/amerzky/ve

  radical.entk         : 0.7.16
  radical.pilot        : 0.60.1-v0.60.1-7-g25bcc08@fix-cheyenne
  radical.saga         : 0.60.0
  radical.utils        : 0.60.1

Can you please send the result of:

$ ldd /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex.so

Can you please also send me the output like below, please:

(ve)  cheyenne4  amerzky  $ python -v -c 'import regex' 2>&1 | grep -C 3 regex
Python 2.7.15 (default, Jan 11 2019, 15:22:07)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import regex # directory /gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex
# /gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex/__init__.pyc matches /gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex/__init__.py
import regex # precompiled from /gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex/__init__.pyc
# /gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex/regex.pyc matches /gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex/regex.py
import regex.regex # precompiled from /gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex/regex.pyc
# /gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex/_regex_core.pyc matches /gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex/_regex_core.py
import regex._regex_core # precompiled from /gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex/_regex_core.pyc
# /glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/string.pyc matches /glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/string.py
import string # precompiled from /glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/string.pyc
# /gpfs/u/home/amerzky/ve/lib/python2.7/re.pyc matches /gpfs/u/home/amerzky/ve/lib/python2.7/re.py
--
dlopen("/gpfs/u/home/amerzky/ve/lib/python2.7/lib-dynload/_heapq.so", 2);
import _heapq # dynamically loaded from /gpfs/u/home/amerzky/ve/lib/python2.7/lib-dynload/_heapq.so
import thread # builtin
dlopen("/gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex/_regex.so", 2);
import regex._regex # dynamically loaded from /gpfs/u/home/amerzky/ve/lib/python2.7/site-packages/regex/_regex.so
# /glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.pyc matches /glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.py
import threading # precompiled from /glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/threading.pyc
dlopen("/gpfs/u/home/amerzky/ve/lib/python2.7/lib-dynload/time.so", 2);
--
...
Weiming-Hu commented 5 years ago

Here is the output.

(venv) wuh20@cheyenne6:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> ldd /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex.so
    linux-vdso.so.1 (0x00007fffedb05000)
    libm.so.6 => /glade/u/apps/ch/os/lib64/libm.so.6 (0x00007fffed554000)
    libdl.so.2 => /glade/u/apps/ch/os/lib64/libdl.so.2 (0x00007fffed350000)
    librt.so.1 => /glade/u/apps/ch/os/lib64/librt.so.1 (0x00007fffed147000)
    libpthread.so.0 => /glade/u/apps/ch/os/lib64/libpthread.so.0 (0x00007fffecf2a000)
    libc.so.6 => /glade/u/apps/ch/os/lib64/libc.so.6 (0x00007fffecb82000)
    /gpfs/u/home/wuh20/.linuxbrew/Cellar/glibc/2.23/lib64/ld-linux-x86-64.so.2 (0x0000555555554000)
(venv) wuh20@cheyenne6:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> python -v -c 'import regex' 2>&1 | grep -C 3 regex
Python 2.7.15 (default, Jan 11 2019, 15:22:07) 
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import regex # directory /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex
# /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/__init__.pyc matches /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/__init__.py
import regex # precompiled from /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/__init__.pyc
# /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/regex.pyc matches /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/regex.py
import regex.regex # precompiled from /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/regex.pyc
# /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex_core.pyc matches /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex_core.py
import regex._regex_core # precompiled from /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex_core.pyc
# /glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/string.pyc matches /glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/string.py
import string # precompiled from /glade/u/apps/ch/opt/python/2.7.15/gnu/7.3.0/lib/python2.7/string.pyc
# /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/re.pyc matches /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/re.py
--
dlopen("/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/lib-dynload/_heapq.so", 2);
import _heapq # dynamically loaded from /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/lib-dynload/_heapq.so
import thread # builtin
dlopen("/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex.so", 2);
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/__init__.py", line 1, in <module>
    from .regex import *
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/regex.py", line 391, in <module>
    import _regex_core
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex_core.py", line 21, in <module>
    import _regex
ImportError: /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex.so: undefined symbol: _intel_fast_memcpy
# clear __builtin__._
# clear sys.path
# clear sys.argv
andre-merzky commented 5 years ago

Any idea what this is:

/gpfs/u/home/wuh20/.linuxbrew/Cellar/glibc/2.23/lib64/ld-linux-x86-64.so.2 (0x0000555555554000)

This is likely the culprit. My output of the ldd is:

(ve)  cheyenne4  amerzky  ~   $ ldd ve/lib/python2.7/site-packages/regex/_regex.so
        linux-vdso.so.1 (0x00007fffedb05000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fffed63d000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fffed294000)
        /lib64/ld-linux-x86-64.so.2 (0x0000555555554000)

The glibc used in your case likely has not been compiled with the default intel compile chain.

If that glibc is needed by you workload, you could try to :

module purge
module load gcc
module load python/2.7.15

and see if the deployment is more forgiving to that libc?

andre-merzky commented 5 years ago

libc -> ld-linux ...

Weiming-Hu commented 5 years ago

I installed linuxbrew a while ago. I guess it has built in some libraries and packages that mess up with my environment. I have just removed it. The result of ldd changed, but still, I'm having the problem.

(venv) wuh20@cheyenne4:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/radical.pilot> ldd /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex.so
    linux-vdso.so.1 (0x00007fffedb05000)
    libm.so.6 => /glade/u/apps/ch/os/lib64/libm.so.6 (0x00007fffed554000)
    libdl.so.2 => /glade/u/apps/ch/os/lib64/libdl.so.2 (0x00007fffed350000)
    librt.so.1 => /glade/u/apps/ch/os/lib64/librt.so.1 (0x00007fffed148000)
    libpthread.so.0 => /glade/u/apps/ch/os/lib64/libpthread.so.0 (0x00007fffecf2a000)
    libc.so.6 => /glade/u/apps/ch/os/lib64/libc.so.6 (0x00007fffecb82000)
    /lib64/ld-linux-x86-64.so.2 (0x0000555555554000)
(venv) wuh20@cheyenne4:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/radical.pilot> entk-version 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/__init__.py", line 4, in <module>
    from radical.entk.pipeline.pipeline import Pipeline
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/entk/pipeline/pipeline.py", line 1, in <module>
    import radical.utils as ru
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/__init__.py", line 14, in <module>
    from .plugin_manager import PluginManager
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/plugin_manager.py", line 14, in <module>
    from .logger import Logger
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/logger.py", line 45, in <module>
    from   .misc    import get_env_ns       as ru_get_env_ns
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/misc.py", line 13, in <module>
    from .ru_regex import ReString
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/ru_regex.py", line 7, in <module>
    import regex
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/__init__.py", line 1, in <module>
    from .regex import *
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/regex.py", line 391, in <module>
    import _regex_core
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex_core.py", line 21, in <module>
    import _regex
ImportError: /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex.so: undefined symbol: _intel_fast_memcpy
(venv) wuh20@cheyenne4:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/radical.pilot> 

I plan to work with Cheyenne sysmin to address my environment setup first. Hope it will address some of these issues.

andre-merzky commented 5 years ago

Could you send me the setting of $LD_LIBRARY_PATH just before you run the radical-stack command?

Weiming-Hu commented 5 years ago
(venv) wuh20@cheyenne3:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> echo $LD_LIBRARY_PATH
/glade/u/apps/ch/opt/mpt_fmods/2.19/intel/17.0.1:/glade/u/apps/ch/opt/mpt/2.19/opt/hpe/hpc/mpt/mpt-2.19/lib:/glade/u/apps/opt/intel/2017u1/compilers_and_libraries/linux/lib/intel64_lin:/ncar/opt/slurm/latest/lib::/glade/u/apps/ch/os/usr/lib64:/glade/u/apps/ch/os/usr/lib:/glade/u/apps/ch/os/lib64:/glade/u/apps/ch/os/lib
andre-merzky commented 5 years ago

:-P

(ve)  cheyenne4  amerzky  ~   $ echo $LD_LIBRARY_PATH
/usr/local/lib

(ve)  cheyenne4  amerzky  ~   $ radical-stack

  python               : 2.7.15
  pythonpath           :
  virtualenv           : /gpfs/u/home/amerzky/ve

  radical.entk         : 0.7.16
  radical.pilot        : 0.60.1-v0.60.1-7-g25bcc08@fix-cheyenne
  radical.saga         : 0.60.0
  radical.utils        : 0.60.1
Weiming-Hu commented 5 years ago

I have resolved the default module issue. My default file in .lmod.d has been mysteriously changed. I have reverted it to the correct default. This should take care of it.

wuh20@sapphire:~$ cheyenne 
Last login: Wed May  1 15:32:33 2019 from 128.118.54.223
******************************************************************************
*                 Welcome to Cheyenne - April 30, 2019
******************************************************************************
                 Today in the Daily Bulletin (dailyb.cisl.ucar.edu)

        - Reminder: Cheyenne compute nodes down May 6-11 during NWSC electrical repairs
        - Alternative HPC login nodes now available
    - Default Cheyenne and Casper libraries will be updated May 6
        - Tutorial for new Cheyenne and Casper users
        - Best practice: Use scratch space for temporary files

Quick Start:          www2.cisl.ucar.edu/resources/cheyenne/quick-start-cheyenne
User environment:     www2.cisl.ucar.edu/resources/cheyenne/user-environment
Key module commands:  module list, module avail, module spider, module help
CISL Help:            cislhelp@ucar.edu -- 303-497-2400
------------------------------------------------------------------------------------
Resetting modules to system default
wuh20@cheyenne4:~> module list

Currently Loaded Modules:
  1) ncarenv/1.2   2) intel/17.0.1   3) ncarcompilers/0.4.1   4) mpt/2.19   5) netcdf/4.6.1

But I repeat the process of creating the virtual environment, I still have the error and things don't seem to change. My LD_LIBRARY_PATH is still different from yours. Even if I change it to yours, I get the error anyway.

(venv) wuh20@cheyenne4:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> export LD_LIBRARY_PATH=/usr/local/lib
(venv) wuh20@cheyenne4:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> echo $LD_LIBRARY_PATH
/usr/local/lib
(venv) wuh20@cheyenne4:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> radical-stack
Traceback (most recent call last):
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/bin/radical-stack", line 3, in <module>
    import radical.utils as ru
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/__init__.py", line 14, in <module>
    from .plugin_manager import PluginManager
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/plugin_manager.py", line 14, in <module>
    from .logger import Logger
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/logger.py", line 45, in <module>
    from   .misc    import get_env_ns       as ru_get_env_ns
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/misc.py", line 13, in <module>
    from .ru_regex import ReString
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/radical/utils/ru_regex.py", line 7, in <module>
    import regex
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/__init__.py", line 1, in <module>
    from .regex import *
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/regex.py", line 391, in <module>
    import _regex_core
  File "/gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex_core.py", line 21, in <module>
    import _regex
ImportError: /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv/lib/python2.7/site-packages/regex/_regex.so: undefined symbol: _intel_fast_memcpy

I'm kind of running out of ideas here....

andre-merzky commented 5 years ago

This is fascinating... Would you mind posting the following files, if you have them? .bashrc .profile .login. What is in your .lmod.d?

Weiming-Hu commented 5 years ago

Here it is.

wuh20@cheyenne4:~> cat .bash_profile 
# CAnEn
export PATH=/glade/u/home/wuh20/github/AnalogsEnsemble/output/bin:$PATH
export PATH=/glade/u/home/wuh20/packages/grib2/wgrib2:$PATH
export PATH=/glade/u/home/wuh20/github/AnalogsEnsemble/dependency/install/bin:$PATH

export LANG=en_US

# CMake
alias cmake=/glade/u/home/wuh20/packages/cmake-3.10.1/bin/cmake

export TMPDIR=/glade/scratch/wuh20
wuh20@cheyenne4:~> cat .profile 
# eval $(/glade/u/home/wuh20/.linuxbrew/bin/brew shellenv)
wuh20@cheyenne4:~> cat .login
cat: .login: No such file or directory
wuh20@cheyenne4:~> ls .lmod.d/
jasper.ch  R.ch
andre-merzky commented 5 years ago

Hmm, your jasper.ch refers to the gnu compiler still. What happens if you retry after `mv .lmod.d .lmod.d.bak'?

Weiming-Hu commented 5 years ago

OK. I think I have fixed this brutally. I moved all the ambiguous hidden files/folders in my home directory to somewhere else. I was almost sure it must be something messing up with my environment. And then tried again. It is working now. I guess the more import thing is to remove .local folder where all python packages are. Virtual environment always reuse the regex package from this local folder since it has been built before. If I remove the folder, virtual environment has to build it again, which resolved the linking issue, though this is my mere conjecture.

(venv) wuh20@cheyenne3:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> ldd venv/lib/python2.7/site-packages/regex/_regex.so 
       linux-vdso.so.1 (0x00007fffedb05000)
       libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fffed63d000)
       libc.so.6 => /lib64/libc.so.6 (0x00007fffed294000)
       /lib64/ld-linux-x86-64.so.2 (0x0000555555554000)
(venv) wuh20@cheyenne3:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> entk-version 
0.7.16
(venv) wuh20@cheyenne3:~/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node> radical-stack 

  python               : 2.7.15
  pythonpath           : 
  virtualenv           : /gpfs/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/venv

  radical.entk         : 0.7.16
  radical.pilot        : 0.60.1-v0.60.1-7-g25bcc08@fix-cheyenne
  radical.saga         : 0.60.0
  radical.utils        : 0.60.1

Thank you so much for your help! I'm going to give this new release a try shortly.

andre-merzky commented 5 years ago

Oh, I am sorry we did not find this earlier! Can you remember where the regex lived under ~/.local, for the next time a user stumbles over this? You did not have a ~/.local/lib where I expected it, IIRC...

But yeah, the interferences between ~/.local and pip / virtualenv is annoying, I stumbled over that a couple of times. Nowadays I do : rm -rf ~/.local; ln -s /dev/null ~/.local which I consider a very polite way to tell pip to fuck off... ;-)

Weiming-Hu commented 5 years ago

The EnTK is running now. But it looks like the process hangs for some reason.

EnTK session: re.session.cheyenne2.wuh20.018031.0001
Creating AppManager                                                           ok
Validating and assigning resource manager                                     ok
Creating analog generation task task-anen-gen-00000
Adding task 1: task-anen-gen-00000
Creating analog generation task task-anen-gen-00001
Adding task 2: task-anen-gen-00001
Creating analog generation task task-anen-gen-00002
Adding task 3: task-anen-gen-00002
Creating analog generation task task-anen-gen-00003
Adding task 4: task-anen-gen-00003
Creating analog generation task task-anen-gen-00004
Adding task 5: task-anen-gen-00004
Creating analog generation task task-anen-gen-00005
Adding task 6: task-anen-gen-00005
Creating analog generation task task-anen-gen-00006
Adding task 7: task-anen-gen-00006
Creating analog generation task task-anen-gen-00007
Adding task 8: task-anen-gen-00007
Creating analog generation task task-anen-gen-00008
Adding task 9: task-anen-gen-00008
Creating analog generation task task-anen-gen-00009
Adding task 10: task-anen-gen-00009
Creating analog generation task task-anen-gen-00010
Adding task 11: task-anen-gen-00010
Creating analog generation task task-anen-gen-00011
Adding task 12: task-anen-gen-00011
Creating analog generation task task-anen-gen-00012
Adding task 13: task-anen-gen-00012
Creating analog generation task task-anen-gen-00013
Adding task 14: task-anen-gen-00013
Creating analog generation task task-anen-gen-00014
Adding task 15: task-anen-gen-00014
Creating analog generation task task-anen-gen-00015
Adding task 16: task-anen-gen-00015
Creating analog generation task task-anen-gen-00016
Adding task 17: task-anen-gen-00016
Creating analog generation task task-anen-gen-00017
Adding task 18: task-anen-gen-00017
Creating analog generation task task-anen-gen-00018
Adding task 19: task-anen-gen-00018
Creating analog generation task task-anen-gen-00019
Adding task 20: task-anen-gen-00019
Creating analog generation task task-anen-gen-00020
Adding task 21: task-anen-gen-00020
Creating analog generation task task-anen-gen-00021
Adding task 22: task-anen-gen-00021
Creating analog generation task task-anen-gen-00022
Adding task 23: task-anen-gen-00022
Creating analog generation task task-anen-gen-00023
Adding task 24: task-anen-gen-00023
Creating analog generation task task-anen-gen-00024
Adding task 25: task-anen-gen-00024
Creating analog generation task task-anen-gen-00025
Adding task 26: task-anen-gen-00025
Creating analog generation task task-anen-gen-00026
Adding task 27: task-anen-gen-00026
Creating analog generation task task-anen-gen-00027
Adding task 28: task-anen-gen-00027
Creating analog generation task task-anen-gen-00028
Adding task 29: task-anen-gen-00028
Creating analog generation task task-anen-gen-00029
Adding task 30: task-anen-gen-00029
Adding stage stage-anen-gen.
Setting up RabbitMQ system                                                    ok
                                                                              ok
create pilot manager                                                          ok
submit 1 pilot(s)
        [ncar.cheyenne:72]
                                                                              ok

Here I have waited for a long time but it does not go through.

andre-merzky commented 5 years ago

Thanks for the feedback! @Weiming-Hu, can you share the script you are running, and if possible also give access to the client and the pilot sandbox on Cheyenne?

Weiming-Hu commented 5 years ago

Of course.

The script I'm running /glade/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node/runme.py.

You should also have access to this folder /glade/u/home/wuh20/github/hpc-workflows/scripts/application_AnEn/year_2/multi-node to access some log files.

Pilot sandbox is generated at /glade/scratch/wuh20/radical.pilot.sandbox.

Thank you.

Weiming-Hu commented 5 years ago

Hi @andre-merzky, sorry for not following up with this earlier. The process hangs for me after I invoke EnTK. Maybe it would be more convenient for both of us to have a chat sometime?

andre-merzky commented 5 years ago

Hey @Weiming-Hu : yeah, sorry also from my end for not following up earlier. I am in the process of reproducing this problem. I had to recreate the pilot virtualenv on Cheyenne, and hope I have some more info until the call.

andre-merzky commented 5 years ago

Good news: the pilot now gets submitted and runs again.

Not so good news: the client sees a segfault in Python because we hit some stack limit. This may be one of the problems we see on other machines, where the default stacksize for python threads is very large. I will have to look into this to confirm - if this is the problem, we can mitigate it. If it is a new system limit we are hitting, its less sure to be easily fixable.

andre-merzky commented 5 years ago

According to Vivek, this requires a support ticket with NCAR to increase thread and process limits.

Weiming-Hu commented 5 years ago

Thank you. Are you suggesting that I should submit a ticket to Cheyenne admin? I remembered that I have already asked them to increase my thread limit. Should I do it again?

andre-merzky commented 5 years ago

I did submit a ticket - but did not yet get an reply. If you got your limit raised already and it still doesn't work, you are likely hung up on something different, or the limit was reset for some reason. Either way though, I won't be able to reproduce the problem until support replies... :/

mturilli commented 5 years ago

Hey @Weiming-Hu, I don't think any action is needed from you. It is for @andre-merzky and myself to open that ticket asking for our thread limits to be increased.

Weiming-Hu commented 5 years ago

Thank you for the clearification.

mturilli commented 5 years ago

Closing because the workflow is not used anymore