neams-th-coe / cardinal

High-Fidelity Multiphysics
https://cardinal.cels.anl.gov/
Other
89 stars 43 forks source link

Installing Cardinal on Sawtooth #819

Open delcmo opened 9 months ago

delcmo commented 9 months ago

Bug Description

I am trying to compile Cardinal on Sawtooth and get the following error message with make -j8:

Cardinal is using HDF5 from    /home/delcmarc/cardinal/contrib/moose/petsc/arch-moose
Cardinal is using MOOSE from   /home/delcmarc/cardinal/contrib/moose
Cardinal is using NekRS from   /home/delcmarc/cardinal/contrib/nekRS
Cardinal is using OpenMC from  /home/delcmarc/cardinal/contrib/openmc
Cardinal is compiled with the following MOOSE modules
  FLUID_PROPERTIES
  HEAT_TRANSFER
  NAVIER_STOKES
  REACTOR
  SOLID_PROPERTIES
  STOCHASTIC_TOOLS
  TENSOR_MECHANICS
  THERMAL_HYDRAULICS
Linking libpng: -lpng16 -lz 
Linking Library /home/delcmarc/cardinal/contrib/moose/framework/libmoose-opt.la...
Linking Library /home/delcmarc/cardinal/contrib/moose/modules/solid_properties/lib/libsolid_properties-opt.la...
libtool: warning: '/home/delcmarc/cardinal/contrib/moose/scripts/../libmesh/installed/lib/libmesh_opt.la' seems to be moved
libtool: warning: '/home/delcmarc/cardinal/contrib/moose/scripts/../libmesh/installed/lib/libnetcdf.la' seems to be moved
libtool: warning: '/home/delcmarc/cardinal/contrib/moose/scripts/../libmesh/installed/lib/libtimpi_opt.la' seems to be moved
/usr/bin/grep: /apps/local/mvapich2/2.3.3-gcc-9.2.0/lib/libmpicxx.la: No such file or directory
/usr/bin/sed: can't read /apps/local/mvapich2/2.3.3-gcc-9.2.0/lib/libmpicxx.la: No such file or directory
libtool:   error: '/apps/local/mvapich2/2.3.3-gcc-9.2.0/lib/libmpicxx.la' is not a valid libtool archive
make: *** [/home/delcmarc/cardinal/contrib/moose/framework/moose.mk:397: /home/delcmarc/cardinal/contrib/moose/framework/libmoose-opt.la] Error 1
make: *** Waiting for unfinished jobs....

I did follow the installation instructions on the Cardinal webpage and loaded all modules as instructed. PETSC and LibMesh compiled fine as far as I can tell. I also checked that the path of the files mentioned in the libtool: warning messages are all valid.

Steps to Reproduce

On Sawtooth using the installation instructions from here.

Impact

I need Cardinal installed on Sawtooth for a project with the NEAMS Workbench.

aprilnovak commented 9 months ago

Hi @delcmo - I have very recently built Cardinal on Sawtooth without issue, so am confident we can get to the bottom of this :)

What modules are you using for MPI? The recommended modules on the Cardinal website you link are for OpenMPI, but it looks like the error you are seeing is from mvapich.

delcmo commented 9 months ago

You are correct and I made sure to use the same modules as in here. The modules I load are:

odule purge
module load use.moose
module load moose-tools
module load openmpi/4.1.5_ucx1.14.1
module load cmake/3.27.7-oneapi-2023.2.1-4uzb
module load git-lfs
export CC=mpicc
export CXX=mpicxx
export FC=mpif90
export ENABLE_NEK=true
export ENABLE_OPENMC=true
export ENABLE_DAGMC=false
export CARDINAL_DIR=$HOME/cardinal
export OPENMC_CROSS_SECTIONS=$HOME/cross_sections/endfb-vii.1-hdf5/cross_sections.xml
export NEKRS_HOME=$CARDINAL_DIR/install
export MOOSE_DIR=$CARDINAL_DIR/contrib/moose
export LIBMESH_DIR=$CARDINAL_DIR/contrib/moose/libmesh/installed
export PYTHONPATH=$CARDINAL_DIR/contrib/moose/python:$PYTHONPATH

I used to load module load mvapich2/2.3.3-gcc-9.2.0-xpjm to build Cardinal last year. I cleared my build and install directories but there seems to have a left-over libraries from my previous builds.

Marco

aprilnovak commented 9 months ago

Yes, that's certainly possible - I'd try also cleaning out the MOOSE submodule just to be sure we get everything:

cd cardinal
rm -rf build/ install/
cd contrib/moose
git clean -xfd
cd ../../
make
delcmo commented 9 months ago

Ok, thanks for the suggestion. I was able to compile Cardonal and run the unit tests. Some tests are being skipped and some others fail with an error message:

runWorker Exception: Traceback (most recent call last): File "/home/delcmarc/cardinal/contrib/moose/python/TestHarness/schedulers/Scheduler.py", line 456, in runJob self.queueJobs(jobs, j_lock) File "/home/delcmarc/cardinal/contrib/moose/python/TestHarness/schedulers/Scheduler.py", line 257, in queueJobs self.status_pool.apply_async(self.jobStatus, (job, jobs, j_lock)) File "/apps/moose/stack/moose-tools-2023.10.19/lib/python3.10/multiprocessing/pool.py", line 458, in apply_async self._check_running() File "/apps/moose/stack/moose-tools-2023.10.19/lib/python3.10/multiprocessing/pool.py", line 353, in _check_running raise ValueError("Pool not running")ValueError: Pool not running

Is that expected? (Some of the unit tests were skipped the first time I compiled Cardinal but I cannot remember how many of them are supposed to fail?

Marco

On Mon, Dec 4, 2023 at 4:50 PM April Novak @.***> wrote:

Yes, that's certainly possible - I'd try also cleaning out the MOOSE submodule just to be sure we get everything:

cd cardinal rm -rf build/ install/ cd contrib/moose git clean -xfd cd ../../ make

— Reply to this email directly, view it on GitHub https://github.com/neams-th-coe/cardinal/issues/819#issuecomment-1839541609, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCVN7RWJKEXGITXPD2DYHZATXAVCNFSM6AAAAABAGOHTUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZZGU2DCNRQHE . You are receiving this because you were mentioned.Message ID: @.***>

-- Marc-Olivier Delchini

aprilnovak commented 9 months ago

Would you please attach the whole console output?

./run_tests > out.txt

delcmo commented 9 months ago

I run the unit test on the login node and also on the queue. The output is attached.

On Tue, Dec 5, 2023 at 9:38 AM April Novak @.***> wrote:

Would you please attach the whole console output?

./run_tests > out.txt

— Reply to this email directly, view it on GitHub https://github.com/neams-th-coe/cardinal/issues/819#issuecomment-1840922121, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCTZOH6RJMGU7S7EBKDYH4WWPAVCNFSM6AAAAABAGOHTUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBQHEZDEMJSGE . You are receiving this because you were mentioned.Message ID: @.***>

-- Marc-Olivier Delchini

aprilnovak commented 9 months ago

@delcmo I think the attachment did not go through properly - can you please attach it on github, instead of via an email reply? Or you can email it to me directly.

delcmo commented 9 months ago

Here it is.

unit_tests.txt

aprilnovak commented 9 months ago

Thanks - it looks like some tests are failing due to MPI-related reasons (not normal - something is definitely wrong). Here's one case which fails, looks like all fail in the same way.

    File     : /home/delcmarc/cardinal/contrib/nekRS/3rd_party/occa/src/occa/internal/utils/sys.cpp
    Line     : 937
    Function : dlopen
    Message  : Error loading binary [d810f609fc22f78e/binary] with dlopen: libmpi.so.12: cannot open shared object file: No such file or directory

Perhaps @loganharbour has an idea?

delcmo commented 9 months ago

Any idea on why I get these odd error messages when running the unit tests?

loganharbour commented 9 months ago

Is there some old state in your install from the previous build? Or did you forget to load the relevant modules?

That error comes from MPI no longer being in LD_LIBRARY_PATH - i.e., not "loaded"

aprilnovak commented 9 months ago

Thanks @loganharbour. In that case, @delcmo I'd suggest wiping out Cardinal (rm -rf cardinal) and rebuilding from scratch to make sure we don't have any old state.

delcmo commented 9 months ago

@aprilnovak I followed your suggestions and was able to recompile Cardinal and run the unit tests. 5 of them failed:

test:nek_standalone/channel.test ^[[90m......................................^[[0m ^[[31m[min_cpus=2,FINISHED]^[[0m ^[[31mFAILED (TIMEOUT)^[[0m
test:nek_stochastic/quiet_init.driver_multi_2 ^[[90m.........................^[[0m ^[[31m[min_cpus=2,FINISHED]^[[0m ^[[31mFAILED (TIMEOUT)^[[0m
utils/meshes/interassembly.specs ^[[90m.............................................................^[[0m ^[[31mFAILED (CODE 1)^[[0m
utils/meshes/interassembly_w_structures.specs ^[[90m................................................^[[0m ^[[31mFAILED (CODE 1)^[[0m
utils/meshes/assembly.specs ^[[90m..................................................................^[[0m ^[[31mFAILED (CODE 1)^[[0m
--------------------------------------------------------------------------------------------------------------
Ran 528 tests in 3254.8 seconds. Average test time 29.3 seconds, maximum test time 401.5 seconds.
^[[1m^[[32m523 passed^[[0m, ^[[1m89 skipped^[[0m, ^[[1m0 pending^[[0m, ^[[1m^[[31m5 FAILED^[[0m

which seems a more reasonable behavior.

aprilnovak commented 9 months ago

That looks better! Those are normal - we have a few tests (on the order of 5) which take a long time to run. Depending on the parallel settings you used to launch the test suite, those may time out. NekRS has a very slow JIT process the first time you run a test case.

If you re-run the test suite, you should (hopefully) see everything pass because NekRS will be able to use the JIT cache produced on the first test run, saving lots of time on each individual test.

delcmo commented 9 months ago

I re-run it. Only 4 tests failed and one of them TIMEOUT.

For the other three, I get a CODE 1 error because of numpy being not found.

^[[31mutils/meshes/assembly.specs: ^[[0mWorking Directory: /home/delcmarc/cardinal/utils/meshes/assembly
^[[31mutils/meshes/assembly.specs: ^[[0mRunning command: python mesh.py
^[[31mutils/meshes/assembly.specs: ^[[0mTraceback (most recent call last):
^[[31mutils/meshes/assembly.specs: ^[[0m  File "mesh.py", line 5, in <module>
^[[31mutils/meshes/assembly.specs: ^[[0m    import numpy as np
^[[31mutils/meshes/assembly.specs: ^[[0mModuleNotFoundError: No module named 'numpy'
^[[31mutils/meshes/assembly.specs: ^[[0m
^[[31mutils/meshes/assembly.specs: ^[[0m################################################################################
^[[31mutils/meshes/assembly.specs: ^[[0mTester failed, reason: CODE 1
^[[31mutils/meshes/assembly.specs: ^[[0m

I updated the PYTHONPATH with export PYTHONPATH=$CARDINAL_DIR/contrib/moose/python:$PYTHONPATH but it does not seem to help.

I was able to run the tests that are in the documentation https://cardinal.cels.anl.gov/hpc.html.

aprilnovak commented 9 months ago

** REVISED

I would just try the following: pip install numpy, and then re-run. Those tests are running a python script.