sdsc / spack

A flexible package manager that supports multiple versions, configurations, platforms, and compilers.
https://spack.io
Other
0 stars 4 forks source link

SDSC: PKG - expanse/0.17.3/cpu/b - Missing NWCHEM (example application) #62

Closed nwolter closed 1 year ago

nwolter commented 1 year ago

Looks like this may already be ready for testing Versions: nwchem/7.0.2/dvlc2dr nwchem/7.0.2/mhgjuie

mkandes commented 1 year ago

Yes.

[mkandes@login02 ~]$ module spider nwchem/7.0.2

----------------------------------------------------------------------------
  nwchem/7.0.2:
----------------------------------------------------------------------------
     Versions:
        nwchem/7.0.2/dvlc2dr
        nwchem/7.0.2/mhgjuie

----------------------------------------------------------------------------
  For detailed information about a specific "nwchem/7.0.2" package (including how to load the modules) use the module's full name. Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider nwchem/7.0.2/mhgjuie
----------------------------------------------------------------------------

[mkandes@login02 ~]$
nwolter commented 1 year ago

nwchem/7.0.2/dvlc2dr(gcc) looks good

nwolter commented 1 year ago

nwchem/7.0.2/mhgjuie(aocc) has errors. first error is: ls

Caching 1-el integrals [90] Received an Error in Communication: (1) 90:ga_copy:ngai_put_common:check subscript failed:1441 not in (1:240) dim=1: [76] Received an Error in Communication: (1) 76:ga_copy:ngai_put_common:check subscript failed:1201 not in (1:240) dim=1: [92] Received an Error in Communication: (1) 92:ga_copy:ngai_put_common:check subscript failed:1441 not in (1:240) dim=1: [96] Received an Error in Communication: (1) 96:ga_copy:ngai_put_common:check subscript failed:1441 not in (1:240) dim=1: [116] Received an Error in Communication: (1) 116:ga_copy:ngai_put_common:check subscript failed:1681 not in (1:240) dim=1:

mkandes commented 1 year ago

@nwolter - Okay, let's revisit with the production instance then.

nwolter commented 1 year ago

Two example scripts 1) gcc - example tested example script located at; /expanse/lustre/projects/use300/nickel/2023/nwchem/nwchem-2node.sb 2) aocc - example tested (failed)

.
.
  Caching 1-el integrals
_shm_attach: shm_open: Too many open files in system
[113] Received an Error in Communication: (-1) _shm_attach: shm_open
_shm_attach: shm_open: Too many open files in system
.
.
mkandes commented 1 year ago

@nwolter @mkandes - Check Jira for previous ticket asking Systems Group to change file limits on nodes to fix this problem.

mkandes commented 1 year ago

Open file limits on compute nodes are still the same

[spack_cpu@exp-15-56 openmpi@4.1.3]$ cat /etc/systemd/system/slurmd.service.d/override.conf
##
## THIS FILE IS MANAGED BY bright-ansible
## See: /var/bright-ansible/files/etc/systemd/system/slurmd.service.d/override.conf
##
[Service]
LimitNOFILE=548000
[spack_cpu@exp-15-56 openmpi@4.1.3]$ ulimit -n
548000
[spack_cpu@exp-15-56 openmpi@4.1.3]$

as set in https://hpc-sdsc.atlassian.net/browse/EXP-83 to resolve this problem previously. i.e., maybe we should retest and see if the failure above was due to conditions on the system at the time.

mkandes commented 1 year ago

Both @mkandes and @nwolter confirm we're still seeing the following communication-related errors for the AOCC-based build of NWChem in expanse/0.17.3/cpu/b.

Summary of "ao basis" -> "ao basis" (cartesian)
 ------------------------------------------------------------------------------
       Tag                 Description            Shells   Functions and Types
 ---------------- ------------------------------  ------  ---------------------
 C                           6-31g*                  6       15   3s2p1d

  Caching 1-el integrals 
40:ga_copy:ngai_put_common:check subscript failed:451 not in (1:225) dim=1:Received an Error in Communication
69:ga_copy:ngai_put_common:check subscript failed:901 not in (1:225) dim=1:Received an Error in Communication
76:ga_copy:ngai_put_common:check subscript failed:901 not in (1:225) dim=1:Received an Error in Communication

As such, we've decided to remove these versions from production and report the issue back to the AMD Spack SIG for further review with a plan to revisit if we can deploy a production-ready AMD build in the future.

mkandes commented 1 year ago

Removed.

[spack_cpu@exp-15-56 b]$ !6081
. /cm/shared/apps/spack/0.17.3/cpu/b/share/spack/setup-env.sh
[spack_cpu@exp-15-56 b]$ spack find -lvd nwchem
==> 3 installed packages
-- linux-rocky8-zen2 / aocc@3.2.0 -------------------------------
itamucg nwchem@7.0.2~mpipr~openmp
6sfatsa     amdblis@3.1+blas+cblas~ilp64+shared+static threads=none
kiytcz3         python@3.8.12+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3~ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93189bc278fbc37a50ed7f183bd8aaf249a8e1670a465f0db6bb4f8cf87,4c2457325f2b608b1b6a2c63087df8c26e07db3e3d493caf36a56f0ecf6fb768,f2fd060afc4b4618fe8104c4c5d771f36dc55b1db5a4623785a4ea707ec72fb4
pvpmqs4             bzip2@1.0.8~debug~pic+shared
d4wiitw             expat@2.4.1+libbsd
zpca44r                 libbsd@0.11.3
tlqbjjl                     libmd@1.0.3
fwban2n             gdbm@1.19
zcvyw4d                 readline@8.1
vrkfn5j                     ncurses@6.2~symlinks+termlib abi=none
a5ype3y             gettext@0.21+bzip2+curses+git~libunistring+libxml2+tar+xz
wsl5g6s                 libiconv@1.16 libs=shared,static
6xxubr4                 libxml2@2.9.12~python
pc3ghll                     xz@5.2.5~pic libs=shared,static
4flcxn7                     zlib@1.2.11+optimize+pic+shared
zn3a5tk                 tar@1.34
34w374u             libffi@3.3 patches=26f26c6f29a7ce9bf370ad3ab2610f99365b4bdd7b82e7c31df41a3370d685c0
5h6inaj             sqlite@3.36.0+column_metadata+fts~functions~rtree
bj7ksgi             util-linux-uuid@2.36.2
t757st4     amdfftw@3.1~amd-app-opt~amd-fast-planner~amd-mpi-vader-limit~amd-top-n-planner~amd-trans~debug~mpi~openmp+shared~static~threads precision=double,float
qn3awdo         texinfo@6.5 patches=12f6edb0c6b270b8c8dba2ce17998c580db01182d871ee32b7b6e4129bd1d23a,1732115f651cff98989cb0215d8f64da5e0f7911ebf0c13b064920f088f2ffe1
cytiibn             perl@5.32.0+cpanm+shared+threads
cthafj2                 berkeley-db@18.1.40+cxx~docs+stl patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522
f37pl2z     amdlibflame@3.1~debug~ilp64+lapack2flame+shared+static threads=none
7qg2ts5     amdscalapack@3.1~ilp64~ipo+pic+shared build_type=Release
xigazqd         openmpi@4.1.3~atomics~cuda~cxx~cxx_exceptions~gpfs~internal-hwloc~java+legacylaunchers+lustre~memchecker+pmi+pmix+romio~rsh~singularity+static+vt+wrapper-rpath cuda_arch=none fabrics=ucx schedulers=slurm
2ewlbuo             hwloc@2.6.0~cairo~cuda~gl~libudev+libxml2~netloc~nvml~opencl+pci~rocm+shared
whgqok2                 libpciaccess@0.16
tax5liq             libevent@2.1.8~openssl
soqjoas             lustre@2.15.2
g44vo3y             numactl@2.0.14 patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006,ff37630df599cfabf0740518b91ec8daaf18e8f288b19adaae5364dc1f6b2296
y33fcsl             pmix@3.2.1~docs+pmi_backwards_compatibility~restful
3r2kfj2             slurm@21.08.8~gtk~hdf5~hwloc~mariadb~pmix+readline~restd sysconfdir=PREFIX/etc
wla3unl             ucx@1.10.1~assertions~cm+cma~cuda+dc~debug+dm~gdrcopy+ib-hw-tm~java~knem~logging+mlx5-dv+optimizations~parameter_checking+pic+rc~rocm+thread_multiple+ud~xpmem cuda_arch=none
wbadl55                 rdma-core@43.0~ipo build_type=RelWithDebInfo

hpjucmf nwchem@7.0.2+mpipr+openmp
6sfatsa     amdblis@3.1+blas+cblas~ilp64+shared+static threads=none
kiytcz3         python@3.8.12+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3~ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93189bc278fbc37a50ed7f183bd8aaf249a8e1670a465f0db6bb4f8cf87,4c2457325f2b608b1b6a2c63087df8c26e07db3e3d493caf36a56f0ecf6fb768,f2fd060afc4b4618fe8104c4c5d771f36dc55b1db5a4623785a4ea707ec72fb4
pvpmqs4             bzip2@1.0.8~debug~pic+shared
d4wiitw             expat@2.4.1+libbsd
zpca44r                 libbsd@0.11.3
tlqbjjl                     libmd@1.0.3
fwban2n             gdbm@1.19
zcvyw4d                 readline@8.1
vrkfn5j                     ncurses@6.2~symlinks+termlib abi=none
a5ype3y             gettext@0.21+bzip2+curses+git~libunistring+libxml2+tar+xz
wsl5g6s                 libiconv@1.16 libs=shared,static
6xxubr4                 libxml2@2.9.12~python
pc3ghll                     xz@5.2.5~pic libs=shared,static
4flcxn7                     zlib@1.2.11+optimize+pic+shared
zn3a5tk                 tar@1.34
34w374u             libffi@3.3 patches=26f26c6f29a7ce9bf370ad3ab2610f99365b4bdd7b82e7c31df41a3370d685c0
5h6inaj             sqlite@3.36.0+column_metadata+fts~functions~rtree
bj7ksgi             util-linux-uuid@2.36.2
t757st4     amdfftw@3.1~amd-app-opt~amd-fast-planner~amd-mpi-vader-limit~amd-top-n-planner~amd-trans~debug~mpi~openmp+shared~static~threads precision=double,float
qn3awdo         texinfo@6.5 patches=12f6edb0c6b270b8c8dba2ce17998c580db01182d871ee32b7b6e4129bd1d23a,1732115f651cff98989cb0215d8f64da5e0f7911ebf0c13b064920f088f2ffe1
cytiibn             perl@5.32.0+cpanm+shared+threads
cthafj2                 berkeley-db@18.1.40+cxx~docs+stl patches=b231fcc4d5cff05e5c3a4814f6a5af0e9a966428dc2176540d2c05aff41de522
f37pl2z     amdlibflame@3.1~debug~ilp64+lapack2flame+shared+static threads=none
7qg2ts5     amdscalapack@3.1~ilp64~ipo+pic+shared build_type=Release
xigazqd         openmpi@4.1.3~atomics~cuda~cxx~cxx_exceptions~gpfs~internal-hwloc~java+legacylaunchers+lustre~memchecker+pmi+pmix+romio~rsh~singularity+static+vt+wrapper-rpath cuda_arch=none fabrics=ucx schedulers=slurm
2ewlbuo             hwloc@2.6.0~cairo~cuda~gl~libudev+libxml2~netloc~nvml~opencl+pci~rocm+shared
whgqok2                 libpciaccess@0.16
tax5liq             libevent@2.1.8~openssl
soqjoas             lustre@2.15.2
g44vo3y             numactl@2.0.14 patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006,ff37630df599cfabf0740518b91ec8daaf18e8f288b19adaae5364dc1f6b2296
y33fcsl             pmix@3.2.1~docs+pmi_backwards_compatibility~restful
3r2kfj2             slurm@21.08.8~gtk~hdf5~hwloc~mariadb~pmix+readline~restd sysconfdir=PREFIX/etc
wla3unl             ucx@1.10.1~assertions~cm+cma~cuda+dc~debug+dm~gdrcopy+ib-hw-tm~java~knem~logging+mlx5-dv+optimizations~parameter_checking+pic+rc~rocm+thread_multiple+ud~xpmem cuda_arch=none
wbadl55                 rdma-core@43.0~ipo build_type=RelWithDebInfo

-- linux-rocky8-zen2 / gcc@10.2.0 -------------------------------
qogdxxi nwchem@7.0.2~mpipr~openmp
bmvqsbr     fftw@3.3.10+mpi~openmp~pfft_patches precision=double,float
oq3qvsv         openmpi@4.1.3~atomics~cuda~cxx~cxx_exceptions~gpfs~internal-hwloc~java+legacylaunchers+lustre~memchecker+pmi+pmix+romio~rsh~singularity+static+vt+wrapper-rpath cuda_arch=none fabrics=ucx schedulers=slurm
7rqkdv4             hwloc@2.6.0~cairo~cuda~gl~libudev+libxml2~netloc~nvml~opencl+pci~rocm+shared
ykynzrw                 libpciaccess@0.16
mgovjpj                 libxml2@2.9.12~python
zduoj2d                     libiconv@1.16 libs=shared,static
paz7hxz                     xz@5.2.5~pic libs=shared,static
ws4iari                     zlib@1.2.11+optimize+pic+shared
5lhvslt                 ncurses@6.2~symlinks+termlib abi=none
bimlmtn             libevent@2.1.8~openssl
fy2cjdg             lustre@2.15.2
ckhyr5e             numactl@2.0.14 patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006,ff37630df599cfabf0740518b91ec8daaf18e8f288b19adaae5364dc1f6b2296
dpvrfip             pmix@3.2.1~docs+pmi_backwards_compatibility~restful
4kvl3fd             slurm@21.08.8~gtk~hdf5~hwloc~mariadb~pmix+readline~restd sysconfdir=PREFIX/etc
dnpjjuc             ucx@1.10.1~assertions~cm+cma~cuda+dc~debug+dm~gdrcopy+ib-hw-tm~java~knem~logging+mlx5-dv+optimizations~parameter_checking+pic+rc~rocm+thread_multiple+ud~xpmem cuda_arch=none
xjr3cuj                 rdma-core@43.0~ipo build_type=RelWithDebInfo
pywku55     netlib-scalapack@2.1.0~ipo+pic+shared build_type=Release patches=1c9ce5fee1451a08c2de3cc87f446aeda0b818ebbce4ad0d980ddf2f2a0b2dc4,f2baedde688ffe4c20943c334f580eb298e04d6f35c86b90a1f4e8cb7ae344a2
fgk2tlu         openblas@0.3.18~bignuma~consistent_fpcsr~ilp64+locking+pic+shared threads=none
7zdjza7     python@3.8.12+bz2+ctypes+dbm~debug+libxml2+lzma~nis+optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93189bc278fbc37a50ed7f183bd8aaf249a8e1670a465f0db6bb4f8cf87,4c2457325f2b608b1b6a2c63087df8c26e07db3e3d493caf36a56f0ecf6fb768,f2fd060afc4b4618fe8104c4c5d771f36dc55b1db5a4623785a4ea707ec72fb4
pulggjv         bzip2@1.0.8~debug~pic+shared
tawwsnw         expat@2.4.1+libbsd
wblxldx             libbsd@0.11.3
rgboqoh                 libmd@1.0.3
clf6bmr         gdbm@1.19
clxlnwz             readline@8.1
ey3k6dv         gettext@0.21+bzip2+curses+git~libunistring+libxml2+tar+xz
e2brhcb             tar@1.34
5oh6vxq         libffi@3.3 patches=26f26c6f29a7ce9bf370ad3ab2610f99365b4bdd7b82e7c31df41a3370d685c0
v3ycaao         openssl@1.1.1k~docs certs=system
fxmvvsx         sqlite@3.36.0+column_metadata+fts~functions~rtree
vncpkij         util-linux-uuid@2.36.2

[spack_cpu@exp-15-56 b]$ spack uninstall nwchem@7.0.2 % aocc@3.2.0
==> Error: nwchem@7.0.2%aocc@3.2.0 matches multiple packages:

    -- linux-rocky8-zen2 / aocc@3.2.0 -------------------------------
    itamucg nwchem@7.0.2  hpjucmf nwchem@7.0.2

==> Error: You can either:
    a) use a more specific spec, or
    b) specify the spec by its hash (e.g. `spack uninstall /hash`), or
    c) use `spack uninstall --all` to uninstall ALL matching specs.

[spack_cpu@exp-15-56 b]$ spack uninstall --all nwchem@7.0.2 % aocc@3.2.0
==> The following packages will be uninstalled:

    -- linux-rocky8-zen2 / aocc@3.2.0 -------------------------------
    itamucg nwchem@7.0.2  hpjucmf nwchem@7.0.2

==> Do you want to proceed? [y/N] y
==> Successfully uninstalled nwchem@7.0.2%aocc@3.2.0+mpipr+openmp arch=linux-rocky8-zen2/hpjucmf
==> Successfully uninstalled nwchem@7.0.2%aocc@3.2.0~mpipr~openmp arch=linux-rocky8-zen2/itamucg
[spack_cpu@exp-15-56 b]$
mkandes commented 1 year ago

Refreshing module environment.

[spack_cpu@exp-15-56 b]$ !5927
spack module lmod refresh -y
==> Regenerating lmod module files
==> OpenFOAM bashrc env: /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen2/aocc-3.2.0/openfoam-2106-jz42us227mirxrhqjvojlaut2giuh74j/etc/bashrc
[spack_cpu@exp-15-56 b]$ 
mkandes commented 1 year ago

Closing issue. https://github.com/sdsc/spack/commit/494c75a06253920a9e9c61f2bfc5f5ee6f0a1fae