hashdist / hashstack

Collection of software profiles for HashDist
https://hashdist.github.io/
51 stars 60 forks source link

Can't install any package with hashdist #972

Closed ghost closed 7 years ago

ghost commented 7 years ago

Hello I want to install several packages with hashdist in a cluster that uses a package manager called dotkit, hence the error below that I get when hashdist tries to install the first package:

2016/11/11 14:05:23 - INFO: [package:run_job] running [u'/bin/bash', '_hashdist/build.sh'] 2016/11/11 14:05:23 - INFO: [package:run_job] environment: 2016/11/11 14:05:23 - INFO: [package:run_job] {'ARTIFACT': u'/g/g92/miguel/.hashdist/bld/patchelf/qo37mjqewsga', 2016/11/11 14:05:23 - INFO: [package:run_job] 'BASH': u'/bin/bash', 2016/11/11 14:05:23 - INFO: [package:run_job] 'BUILD': u'/g/g92/miguel/.hashdist/tmp/patchelf-qo37mjqewsga-10', 2016/11/11 14:05:23 - INFO: [package:run_job] 'HASHDIST_CPU_COUNT': '1', 2016/11/11 14:05:23 - INFO: [package:run_job] 'HDIST_CONFIG': '{"gc_roots":"/g/g92/miguel/.hashdist/gcroots","build_stores":[{"dir":"/g/g92/miguel/.hashdist/bld"}],"source_caches":[{"dir":"/g/g92/miguel/.hashdist/src"}],"cache":"/g/g92/miguel/.hashdist/cache","build_temp":"/g/g92/miguel/.hashdist/tmp"}', 2016/11/11 14:05:23 - INFO: [package:run_job] 'HDIST_IMPORT': '', 2016/11/11 14:05:23 - INFO: [package:run_job] 'HDIST_IMPORT_PATHS': '', 2016/11/11 14:05:23 - INFO: [package:run_job] 'HDIST_VIRTUALS': '', 2016/11/11 14:05:23 - INFO: [package:run_job] 'PATH': u'/g/g92/miguel/pythonpackages/bin/:/g/g92/miguel/shawncplus-Vim-toCterm-0f47db8/:/g/g92/miguel/pythonpackages/bin/:/g/g92/miguel/Xvfb/bin/:/g/g92/miguel/jdk1.7.0_79/bin/:/usr/local/tools/openmpi-intel-1.8.4/bin:/usr/local/tools/python-2.7.7/bin:/usr/glob al/tools/clang/chaos_5_x86_64_ib/clang-3.7.0/bin:/usr/local/tools/boost-mpi-1.55.0/bin:/usr/local/tools/vtk-6.1.0/bin:/usr/local/tools/qt-4.8.3/bin:/usr/local/tools/imgtrack-1.0/bin:/usr/local/tools/sqlcipher-2.0.3-0/bin:/usr/local/tools/boost-nompi-1.55.0/bin:/usr/local/tools/ld-auto-rpath/bin:/usr/lib64/qt-3. 3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/global/tools/totalview/m/ansel/default/bin:/collab/usr/global/tools/git/chaos_5_x86_64_ib/git-2.0.0/bin', 2016/11/11 14:05:23 - INFO: [package:run_job] 'PWD': u'/g/g92/miguel/.hashdist/tmp/patchelf-qo37mjqewsga-10'} 2016/11/11 14:05:23 - INFO: [package:run_job] 2016/11/11 14:05:23 - INFO: [package:run_job] This command is part of Dotkit, which you may access 2016/11/11 14:05:23 - INFO: [package:run_job] after initializing via the following command: 2016/11/11 14:05:23 - INFO: [package:run_job] 2016/11/11 14:05:23 - INFO: [package:run_job] For csh/tcsh shells: 2016/11/11 14:05:23 - INFO: [package:run_job] source /usr/local/tools/dotkit/init.csh 2016/11/11 14:05:23 - INFO: [package:run_job] 2016/11/11 14:05:23 - INFO: [package:run_job] For sh/ksh/bash shells: 2016/11/11 14:05:23 - INFO: [package:run_job] . /usr/local/tools/dotkit/init.sh 2016/11/11 14:05:23 - INFO: [package:run_job] 2016/11/11 14:05:23 - INFO: [package:run_job] For zsh shells: 2016/11/11 14:05:23 - INFO: [package:run_job] . /usr/local/tools/dotkit/init.zsh 2016/11/11 14:05:23 - INFO: [package:run_job] 2016/11/11 14:05:23 - ERROR: [package:run_job] Command '[u'/bin/bash', '_hashdist/build.sh']' returned non-zero exit status 1 2016/11/11 14:05:23 - ERROR: [package:run_job] command failed (code=1); raising

As you can see, hashdist wants to run the script build.sh, which is as follows:

set -e export HDIST_IN_BUILD=yes . /usr/local/tools/dotkit/init.sh; use openmpi-intel-1.8.4; export CC=mpicc; export CXX=mpic++; export FC=gfortran; export F77=mpif77; export F90=mpif90; export CPP=cpp; ( export CPPFLAGS="" export LDFLAGS="" ./configure --prefix="${ARTIFACT}" ) make -j ${HASHDIST_CPU_COUNT} make install rm -f ${ARTIFACT}/lib/*.la

but it’s unable to run the first line because Dotkit has not been initialized. I try to pass . /usr/local/tools/dotkit/init.sh in the PROLOGUE , but as you can see it’s useless. Any idea of something I could try?

Thanks Miguel

johannesring commented 7 years ago

You are doing the right thing, but you need to figure out how to initialize Dotkit properly. One thing you can try is to source ~/.bashrc, ~/.profile, ~/.bash_profile, etc. in the PROLOGUE instead.

cekees commented 7 years ago

Yeah, getting that PROLOGUE section right on HPC platforms is tricky. You may want to try running with hit build --debug to drop in to the shell of the build, manually try to initialize Dotkit and then run /bin/bash ./hashdist/build.sh.

ghost commented 7 years ago

The script to initialize DotKit checks if the env variable $HOME exists, but it seems that it's not passed to the shell of the build.

if [ -n "$HOME" ]; then
if [ ! -f $HOME/.nodotkit ]; then
if [ -f /usr/local/tools/dotkit/dotkit/ksh/.dk_init ]; then
  export DK_ROOT=/usr/local/tools/dotkit/dotkit
  . $DK_ROOT/ksh/.dk_init
#  unalter DK_NODE /usr/global/tools/dotkit
#  alter   DK_NODE /usr/global/tools/dotkit  # prepends LC / DEG .dk files to default set of .dk files
  reuse -q lcinit
fi
fi
fi

I can directly load . $DK_ROOT/ksh/.dk_init (which doesn't make the installation work either yet), but I want to know how to pass all the other environment variables that might be required to load DotKit within the shell of the build. How can I do this?

cekees commented 7 years ago

That's what the prologue is for, but the idea is to decouple from the users environment as much as possible. As @johannesring said, you can source the .*rc files in the prologue or you can drop into the debug shell and set them as needed to debug the build and then add just the minimal fixes to the prologue section.

ghost commented 7 years ago

HOME does not appear in the debug shell even defining it in the PROLOGUE. It seems to me that the variables defined in PROLOGUE are exported inside the build.sh, but I need to have HOME before I run build.sh

cekees commented 7 years ago

That's right. The user environment variables are unset in the build shell. You can set HOME yourself in the debugging shell.

ghost commented 7 years ago

Yes, I can do that, but then whenever I run /bin/bash _hashdist/build.sh inside the build shell, everything is cleared out and I get the same error:

This command is part of Dotkit, which you may access
after initializing via the following command:

For csh/tcsh shells:
  source /usr/local/tools/dotkit/init.csh

For sh/ksh/bash shells:
  . /usr/local/tools/dotkit/init.sh

For zsh shells:
  . /usr/local/tools/dotkit/init.zsh

However if I examine the content of _hashdist/build.sh and run each line in the build shell, I can compile and install the package with success (after having defined HOME and sourced /usr/local/tools/dotkit/init.sh in the build shell) Is there any way to pass the build shell environment into /bin/bash _hashdist/build.sh?

cekees commented 7 years ago

In the debug shell you should be able to do HOME=xyz MYVAR=abc ... /bin/bash _hashdist/build.sh or you can just edit _hashdist/build.sh directly. Once you have the set of export VAR= statements require to build properly, put those in the PROLOGUE step. Then exit 1 from the debug shell and try to build your stack again.

ghost commented 7 years ago

I think the best is to edit _hashdist/build.sh directly because . /usr/local/tools/dotkit/init.sh is considerably long. I am going to look into prepend the lines

export HOME=...
. /usr/local/tools/dotkit/init.sh

at the beginning of each _hashdist/build.sh. Since this script is generated for each package (isn't it?) Could you pinpoint to where in hashdist this script is generated? Thanks.

cekees commented 7 years ago

Anything you put in the prologue is going to get written to _hashdist/build.sh for every package.

ghost commented 7 years ago

Ok I wasn't putting the commands in PROLOGUE in the right order. I had included . /usr/local/tools/dotkit/init.sh as first command weeks ago and then removed it, but back then I didn't know I needed export HOME= before anything else. Now I know this and it works if I pass the --debug flag and execute _hashdist/build.sh and exit 0 for each package. Without this flag though, I get the error:

[blas|ERROR] Command '[u'/bin/bash', '_hashdist/build.sh']' returned non-zero exit status 4
[blas|ERROR] command failed (code=4); raising

Can I install the packages in debug mode without any other consequence? Why I get this error without debug? Thanks a lot for your help!

cekees commented 7 years ago

Can you post your profile yaml file? I'm not sure why your _hashdist/build.sh is failing outside of the debug shell.

ghost commented 7 years ago
# This profile file controls your <#> (HashDist) build environment.

# In the future, we'll provide better incorporation of
# automatic environment detection.  For now, have a look
# at the YAML files in the top-level directory and choose
# the most *specific* file that matches your environment.

extends:
- file: linux.yaml
parameters:
  PROLOGUE: |
      export HOME=/g/g92/miguel; . /usr/local/tools/dotkit/init.sh; use openmpi-intel-1.8.4; export CC=mpicc; export CXX=mpic++; export FC=gfortran; export F77=mpif77; export F90=mpif90; export CPP=cpp;
  HOST_MPICC: /usr/local/tools/openmpi-intel-1.8.4/bin/mpicc
  HOST_MPICXX: /usr/local/tools/openmpi-intel-1.8.4/bin/mpic++
  HOST_MPIF77: /usr/local/tools/openmpi-intel-1.8.4/bin/mpif77
  HOST_MPIF90: /usr/local/tools/openmpi-intel-1.8.4/bin/mpif90
  HOST_MPIEXEC: /usr/local/tools/openmpi-intel-1.8.4/bin/mpiexec
  HOST_CMAKE: /usr/local/bin/cmake
  HOST_PETSC_DIR: /g/g92/miguel/petsc-3.6.2/
  HOST_PETSC_ARCH: miguel-opt
  HOST_BOOST: /usr/local/tools/boost-mpi-1.55.0/lib 
  LD_LIBRARY_PATH: /usr/local/tools/ic-14.0.174/lib/:/usr/local/tools/openmpi-intel-1.8.4/lib/openmpi:/usr/local/tools/openmpi-intel-1.8.4/lib:/usr/local/tools/boost-mpi-1.55.0/lib:/usr/local/tools/vtk-6.1.0/lib/python2.6/site-packages/vtk:/usr/local/tools/vtk-6.1.0/lib:/usr/local/tools/qt-4.8.3/lib:/usr/local/tools/boost-nompi-1.49.0/lib:/usr/local/tools/sqlcipher-2.0.3-0/lib:/usr/local/tools/boost-nompi-1.55.0/lib
  PATH: /g/g92/miguel/pythonpackages/bin/:/g/g92/miguel/shawncplus-Vim-toCterm-0f47db8/:/g/g92/miguel/pythonpackages/bin/:/g/g92/miguel/Xvfb/bin/:/g/g92/miguel/jdk1.7.0_79/bin/:/usr/local/tools/openmpi-intel-1.8.4/bin:/usr/local/tools/python-2.7.7/bin:/usr/global/tools/clang/chaos_5_x86_64_ib/clang-3.7.0/bin:/usr/local/tools/boost-mpi-1.55.0/bin:/usr/local/tools/vtk-6.1.0/bin:/usr/local/tools/qt-4.8.3/bin:/usr/local/tools/imgtrack-1.0/bin:/usr/local/tools/sqlcipher-2.0.3-0/bin:/usr/local/tools/boost-nompi-1.55.0/bin:/usr/local/tools/ld-auto-rpath/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/global/tools/totalview/m/ansel/default/bin:/collab/usr/global/tools/git/chaos_5_x86_64_ib/git-2.0.0/bin

packages:
  launcher:
  cmake:
    use: host-cmake
  mpi:
    use: host-mpi
  blas:
    use: openblas
  hdf5:
  petsc:
    use: host-petsc
    version: '3.6.2'
  h5py:
  pyvtk:
  matplotlib:
  swig:
  scipy:
  cbcblock:

package_dirs:
- /g/g92/miguel/petsc-3.6.2
- pkgs
- base
cekees commented 7 years ago

Nothing is jumping out at me here. I suspect that when you run /bin/bash _hashdist/build.sh in the debug shell it is returning error code 4, so even though it "works" in the debug shell you may need to do some debugging to see why the return code isn't 0.

ghost commented 7 years ago

You're correct. I typed echo $? after /bin/bash _hashdist/build.sh and I got 4. I found out that the problem is in one of the commands in . /usr/local/tools/dotkit/init.sh. Before, I was running . /usr/local/tools/dotkit/init.sh outside of the _hashdist/build.sh and had no warnings. Inside _hashdist/build.sh, the first line is set -e and when /usr/local/tools/dotkit/init.sh is sourced , it aborts the build because of one of the commands within the init.sh. Specifically, reuse -q lcinit which is part of DotKit. Even running this command after set -e in my regular shell disconnects me from the cluster. The strange thing is that running reuse -q lcinit and echo $? returns 0. I will ask the admins what's going on. Is there anyway to invalidate the set -e in the PROLOGUE?

cekees commented 7 years ago

You may be able to run the dotkit/init.sh inside some logic that will trap the error. If you can get a response from the admins on why it's returning an error that would probably be easier and better.

ghost commented 7 years ago

I ended up adding set +e in my PROLOGUE before initializing dotkit and calling the commands that throw the error and then set -e at the end. This seems to work. Thanks for the help.