Open delcmo opened 11 months ago
Hi @delcmo - I have very recently built Cardinal on Sawtooth without issue, so am confident we can get to the bottom of this :)
What modules are you using for MPI? The recommended modules on the Cardinal website you link are for OpenMPI, but it looks like the error you are seeing is from mvapich.
You are correct and I made sure to use the same modules as in here. The modules I load are:
odule purge
module load use.moose
module load moose-tools
module load openmpi/4.1.5_ucx1.14.1
module load cmake/3.27.7-oneapi-2023.2.1-4uzb
module load git-lfs
export CC=mpicc
export CXX=mpicxx
export FC=mpif90
export ENABLE_NEK=true
export ENABLE_OPENMC=true
export ENABLE_DAGMC=false
export CARDINAL_DIR=$HOME/cardinal
export OPENMC_CROSS_SECTIONS=$HOME/cross_sections/endfb-vii.1-hdf5/cross_sections.xml
export NEKRS_HOME=$CARDINAL_DIR/install
export MOOSE_DIR=$CARDINAL_DIR/contrib/moose
export LIBMESH_DIR=$CARDINAL_DIR/contrib/moose/libmesh/installed
export PYTHONPATH=$CARDINAL_DIR/contrib/moose/python:$PYTHONPATH
I used to load module load mvapich2/2.3.3-gcc-9.2.0-xpjm
to build Cardinal last year. I cleared my build and install directories but there seems to have a left-over libraries from my previous builds.
Marco
Yes, that's certainly possible - I'd try also cleaning out the MOOSE submodule just to be sure we get everything:
cd cardinal
rm -rf build/ install/
cd contrib/moose
git clean -xfd
cd ../../
make
Ok, thanks for the suggestion. I was able to compile Cardonal and run the unit tests. Some tests are being skipped and some others fail with an error message:
runWorker Exception: Traceback (most recent call last): File "/home/delcmarc/cardinal/contrib/moose/python/TestHarness/schedulers/Scheduler.py", line 456, in runJob self.queueJobs(jobs, j_lock) File "/home/delcmarc/cardinal/contrib/moose/python/TestHarness/schedulers/Scheduler.py", line 257, in queueJobs self.status_pool.apply_async(self.jobStatus, (job, jobs, j_lock)) File "/apps/moose/stack/moose-tools-2023.10.19/lib/python3.10/multiprocessing/pool.py", line 458, in apply_async self._check_running() File "/apps/moose/stack/moose-tools-2023.10.19/lib/python3.10/multiprocessing/pool.py", line 353, in _check_running raise ValueError("Pool not running")ValueError: Pool not running
Is that expected? (Some of the unit tests were skipped the first time I compiled Cardinal but I cannot remember how many of them are supposed to fail?
Marco
On Mon, Dec 4, 2023 at 4:50 PM April Novak @.***> wrote:
Yes, that's certainly possible - I'd try also cleaning out the MOOSE submodule just to be sure we get everything:
cd cardinal rm -rf build/ install/ cd contrib/moose git clean -xfd cd ../../ make
— Reply to this email directly, view it on GitHub https://github.com/neams-th-coe/cardinal/issues/819#issuecomment-1839541609, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCVN7RWJKEXGITXPD2DYHZATXAVCNFSM6AAAAABAGOHTUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZZGU2DCNRQHE . You are receiving this because you were mentioned.Message ID: @.***>
-- Marc-Olivier Delchini
Would you please attach the whole console output?
./run_tests > out.txt
I run the unit test on the login node and also on the queue. The output is attached.
On Tue, Dec 5, 2023 at 9:38 AM April Novak @.***> wrote:
Would you please attach the whole console output?
./run_tests > out.txt
— Reply to this email directly, view it on GitHub https://github.com/neams-th-coe/cardinal/issues/819#issuecomment-1840922121, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCTZOH6RJMGU7S7EBKDYH4WWPAVCNFSM6AAAAABAGOHTUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBQHEZDEMJSGE . You are receiving this because you were mentioned.Message ID: @.***>
-- Marc-Olivier Delchini
@delcmo I think the attachment did not go through properly - can you please attach it on github, instead of via an email reply? Or you can email it to me directly.
Here it is.
Thanks - it looks like some tests are failing due to MPI-related reasons (not normal - something is definitely wrong). Here's one case which fails, looks like all fail in the same way.
File : /home/delcmarc/cardinal/contrib/nekRS/3rd_party/occa/src/occa/internal/utils/sys.cpp
Line : 937
Function : dlopen
Message : Error loading binary [d810f609fc22f78e/binary] with dlopen: libmpi.so.12: cannot open shared object file: No such file or directory
Perhaps @loganharbour has an idea?
Any idea on why I get these odd error messages when running the unit tests?
Is there some old state in your install from the previous build? Or did you forget to load the relevant modules?
That error comes from MPI no longer being in LD_LIBRARY_PATH
- i.e., not "loaded"
Thanks @loganharbour. In that case, @delcmo I'd suggest wiping out Cardinal (rm -rf cardinal
) and rebuilding from scratch to make sure we don't have any old state.
@aprilnovak I followed your suggestions and was able to recompile Cardinal and run the unit tests. 5 of them failed:
test:nek_standalone/channel.test ^[[90m......................................^[[0m ^[[31m[min_cpus=2,FINISHED]^[[0m ^[[31mFAILED (TIMEOUT)^[[0m
test:nek_stochastic/quiet_init.driver_multi_2 ^[[90m.........................^[[0m ^[[31m[min_cpus=2,FINISHED]^[[0m ^[[31mFAILED (TIMEOUT)^[[0m
utils/meshes/interassembly.specs ^[[90m.............................................................^[[0m ^[[31mFAILED (CODE 1)^[[0m
utils/meshes/interassembly_w_structures.specs ^[[90m................................................^[[0m ^[[31mFAILED (CODE 1)^[[0m
utils/meshes/assembly.specs ^[[90m..................................................................^[[0m ^[[31mFAILED (CODE 1)^[[0m
--------------------------------------------------------------------------------------------------------------
Ran 528 tests in 3254.8 seconds. Average test time 29.3 seconds, maximum test time 401.5 seconds.
^[[1m^[[32m523 passed^[[0m, ^[[1m89 skipped^[[0m, ^[[1m0 pending^[[0m, ^[[1m^[[31m5 FAILED^[[0m
which seems a more reasonable behavior.
That looks better! Those are normal - we have a few tests (on the order of 5) which take a long time to run. Depending on the parallel settings you used to launch the test suite, those may time out. NekRS has a very slow JIT process the first time you run a test case.
If you re-run the test suite, you should (hopefully) see everything pass because NekRS will be able to use the JIT cache produced on the first test run, saving lots of time on each individual test.
I re-run it. Only 4 tests failed and one of them TIMEOUT.
For the other three, I get a CODE 1 error because of numpy
being not found.
^[[31mutils/meshes/assembly.specs: ^[[0mWorking Directory: /home/delcmarc/cardinal/utils/meshes/assembly
^[[31mutils/meshes/assembly.specs: ^[[0mRunning command: python mesh.py
^[[31mutils/meshes/assembly.specs: ^[[0mTraceback (most recent call last):
^[[31mutils/meshes/assembly.specs: ^[[0m File "mesh.py", line 5, in <module>
^[[31mutils/meshes/assembly.specs: ^[[0m import numpy as np
^[[31mutils/meshes/assembly.specs: ^[[0mModuleNotFoundError: No module named 'numpy'
^[[31mutils/meshes/assembly.specs: ^[[0m
^[[31mutils/meshes/assembly.specs: ^[[0m################################################################################
^[[31mutils/meshes/assembly.specs: ^[[0mTester failed, reason: CODE 1
^[[31mutils/meshes/assembly.specs: ^[[0m
I updated the PYTHONPATH
with export PYTHONPATH=$CARDINAL_DIR/contrib/moose/python:$PYTHONPATH
but it does not seem to help.
I was able to run the tests that are in the documentation https://cardinal.cels.anl.gov/hpc.html.
** REVISED
I would just try the following: pip install numpy
, and then re-run. Those tests are running a python script.
Bug Description
I am trying to compile Cardinal on Sawtooth and get the following error message with
make -j8
:I did follow the installation instructions on the Cardinal webpage and loaded all modules as instructed. PETSC and LibMesh compiled fine as far as I can tell. I also checked that the path of the files mentioned in the
libtool: warning
messages are all valid.Steps to Reproduce
On Sawtooth using the installation instructions from here.
Impact
I need Cardinal installed on Sawtooth for a project with the NEAMS Workbench.