Nightly test failures - Githubissues

jewatkins commented 9 months ago

@mperego There are a lot of failures in the nightlies. Could these be related to some of your recent changes?

Blake builds seem to be broken, one example: Albany/src/landIce/interfaceWithMPAS/Interface.cpp:315:37: error: no matching function for call to 'Albany::SolverFactory::createSolver(Teuchos::RCP<const Teuchos::Comm >&, Teuchos::RCP&)'

@ikalash attaway and cee-compute004 (icc) build failures seem to be correlated to the cee/hpc project space consolidation.

mperego commented 9 months ago

Hi @jewatkins, thanks for checking this. The core and landice tests will be fixed as soon as PR https://github.com/trilinos/Trilinos/pull/12749 gets merged. I fixed the sensitivity checking in Albany and now an issue with Piro is exposed. Hopefully tomorrow will see them pass.

I'll fix the MPAS interface issue appearing on Blake test with a commit today.

ikalash commented 9 months ago

@ikalash attaway and cee-compute004 (icc) build failures seem to be correlated to the cee/hpc project space consolidation.

I checked on cee-compute004, and I think it's OK. I manually was able to configure against the nightly-build Trilinos. Lets see what happened tomorrow.

Regarding attaway, it seems the issue is that the sems modules are not there anymore. Do you know anything about this @jewatkins ? I haven't been following the HPC announcements very well. I can reach out to the sems team if this is a real problem.

jewatkins commented 9 months ago

@ikalash Yes, I looked a little more into it. I think the sems project space was moved/removed and sems needs to replace it. I have a ticket for sems-archive because this also broke modules that e3sm uses. I imagine the same fix will fix our module issue but if not, I'll ask about the regular sems modules.

ikalash commented 9 months ago

Thanks @jewatkins . I will submit a separate ticket with the sems team now.

mperego commented 9 months ago

Missed a few places that needed fixes. I'll push a commit soon that should fix the remaining issues, including the PyAlbany ones.

jewatkins commented 9 months ago

Most systems look good now except cee-compute003 | Trilinos-Linux-3.10.0-1062.1.2.el7.x86_64-gcc-10.1.0-Debug-Serial Albany build is failing because it can't find trilinos.

@ikalash Any ideas on what's happening there?

ikalash commented 9 months ago

Most systems look good now except cee-compute003 | Trilinos-Linux-3.10.0-1062.1.2.el7.x86_64-gcc-10.1.0-Debug-Serial Albany build is failing because it can't find trilinos.

@ikalash Any ideas on what's happening there?

Sorry for the delay in replying. I think cee-compute003 is just flakey. It looks like last night the Trilinos build ran and Albany is still going.

ikalash commented 9 months ago

Seems like there are still a lot of failures today. I noticed that the CALI problems are nan-ing out now (solver not converging: https://sems-cdash-son.sandia.gov/cdash/test/4108631). I know we don't care about CALI anymore, but it would be good to understand what's happening in case it's affecting other Albany problems.

jewatkins commented 9 months ago

There are a lot of failures. I don't think any build is completely clean today. @bartgol looks like you pushed yesterday, any ideas? https://sems-cdash-son.sandia.gov/cdash/index.php?project=Albany

bartgol commented 9 months ago

It is possible that while fixing the last few MPAS-related errors I messed up something else... I should have re-run the whole suite, rather than pushing. I will double check (on Monday, at this point).

Other errors, however, look unrelated, like this

/scratch/albany/build/Albany64BitClang/src/Albany: error while loading shared libraries: libpanzer-expr-eval.so.15: cannot open shared object file: No such file or directory

bartgol commented 9 months ago

I see the issue. Fix coming.

bartgol commented 9 months ago

Some fails should be fixed by 972e86c9. However, I don't know what to do about the tests failing due to missing shared libs... We should log on the test platforms and see if those trilinos libs are indeed installed.

bartgol commented 9 months ago

Lots of tests seem to have been fixed. We still have a handful of failures related to the humboldt mesh. Not on all builds though. I suspect this is due to whether the offline mesh partitioning succeeded or not. I will verify that.

There are also the perf tests that keep failing, but they have been failing since cdahs has any record of them (June last year). IIRC, the issue is making the target value in the regression tests machine dependent. Which we haven't yet found a solution for.

jewatkins commented 9 months ago

There are a lot of different test failures now (I don't see any lib failures). In regards to perf. tests, I only see the weaver perf. tests failing with

 Error: std::mesh::MetaData::declare_field_restriction FAILED for Field<double>["solution", #states: 1] FieldRestriction[ selector: "{UNIVERSAL}", dimension: 1, scalars per entity: 1 ] WITH INCOMPATIBLE REDECLARATION FieldRestriction[ selector: "{UNIVERSAL}", dimension: 2, scalars per entity: 2 ]

which seems new to me. I might be able to look into all this more next week to see what caused it.

mcarlson801 commented 9 months ago

I tried running one of my PyAlbany tuning cases (ant-4-20km) earlier today and ran into the same error.

bartgol commented 8 months ago

Yes, that's due to the solution field being in the input file. Since we moved to STK "simple fields" (the STK ppl seemed to "encourage" this transition), we can no longer re-declare a field with a different layout. In this case, the input exo file has a field "solution" with 1 component, but the problem tries to re-declare it with 2 components.

In normal albany tests, I fixed this by manually eliminating the "solution" field from the input mesh files. The following script can be used for this. Running ./the_script input.exo solution should remove "solution" from the file "input.exo".

#!/bin/bash

if [ "$#" -ne 2 ]; then
  echo "Illegal number of parameters: $#. Usage:"
  echo "  remove_var_from_exofile <filename> <varname>"
  exit 1
fi

fname=$1
vname=$2

# The awk-force is strong in us...
nvars=$(ncdump -h $fname | grep 'num_nod_var =' | awk '{ print $3 }')
all_names=$(ncdump -v name_nod_var $fname | awk '/name_nod_var =/,EOF { print $0 }' | tail -n +2 | head -n $nvars | sed 's/[",\ ;]//g')
ivar=$(echo "$all_names" | awk -v pattern="$vname" '$0 ~ pattern { print FNR}')

# Shift nodal vars: copy i+1 onto i
for i in $(seq $ivar $((nvars-1))); do
  ncap2 -s "vals_nod_var${i}=vals_nod_var$((i+1))*1" -O $fname $fname
done

# Remove last nodal var
ncks -x -v vals_nod_var${nvars} -O $fname $fname

# Shift var names: copy i+1 onto i
for i in $(seq $ivar $((nvars-1))); do
  prev=$((i-1))
  cmd="ncap2 -O -s 'name_nod_var($prev,:)=name_nod_var($i,:)' $fname $fname"
  eval $cmd
done

# Slice num_nod_var dimension
ncks -d num_nod_var,0,$((nvars-2)) -O $fname $fname

# Prune history
ncatted -h -a history,global,d,, -O $fname $fname

Note: the switch to "simple fields" in STK was done in #1010 .

jewatkins commented 8 months ago

Okay I can try this and that should fix some of the performance tests. @bartgol were you going to look at the other two failures?

ERROR: Could not open file '/scratch/albany/build/Albany64Bit/tests/landIce/AsciiMeshes/Humboldt/humboldt_2d.exo.4.0', error = No such file or directory.

Error! Piro::TempusSolver: time-integrator did not make it to final time specified in Input File.  Final time in input file is 0.5, whereas actual final time is 0.  If you'd like to suppress this exception, run with 'Abort on Failure' set to 'false' in Tempus sublist.

bartgol commented 8 months ago

ERROR: Could not open file '/scratch/albany/build/Albany64Bit/tests/landIce/AsciiMeshes/Humboldt/humboldt_2d.exo.4.0', error = No such file or directory.

I think that failure is b/c we don't have the partitioned file in the repo anymore. The cmake logic is supposed to run the seacas util to partition the mesh before running the test. I thought that if seacas IOP is not enabled, the test would not run anyways. I will double check if there's something amiss in the CMake logic.

Error! Piro::TempusSolver: time-integrator did not make it to final time specified in Input File. Final time in input file is 0.5, whereas actual final time is 0. If you'd like to suppress this exception, run with 'Abort on Failure' set to 'false' in Tempus sublist.

This is hard to diagnose from the error msg. I will look at cdash, and see if there's any previous sign of what could be going wrong.

bartgol commented 8 months ago

Ok, I see that in the builds ALBANY_PARALLEL_EXODUS=ON, so we should load a serial mesh, and distribute online. That's what we do for StokesFO tests, but it appears that the humboldt tests don't.

Velocity tests input files have Use Serial Mesh: ${USE_SERIAL_MESH}, where USE_SERIAL_MESH expands to true if ALBANY_PARALLEL_EXODUS=ON, while the humboldt ones have Use Serial Mesh: false. @mperego I think you introduced the USE_SERIAL_MESH logic: can we use that also for humboldt tests?

mperego commented 8 months ago

I don't think I introduced the logic, but sure, it makes sense to use it for the Humboldt tests as well. I'm working on converting some of the Epetra tests to Tpetra, and I can add this change as well.

jewatkins commented 8 months ago

I suppose we should just set all tests to Use Serial Mesh: true if the decomposed files are no longer there?

bartgol commented 8 months ago

Well, that's what the USE_SERIAL_MESH cmake var does: if ALBANY_PARALLEL_EXODUS=ON, it's true, otherwise it's false (and a FIXTURE_SETUP test is run to "prepare" the mesh decomposition).

jewatkins commented 8 months ago

Yes, that's due to the solution field being in the input file. Since we moved to STK "simple fields" (the STK ppl seemed to "encourage" this transition), we can no longer re-declare a field with a different layout. In this case, the input exo file has a field "solution" with 1 component, but the problem tries to re-declare it with 2 components.

In normal albany tests, I fixed this by manually eliminating the "solution" field from the input mesh files. The following script can be used for this. Running ./the_script input.exo solution should remove "solution" from the file "input.exo".

@bartgol I tried applying this to the meshes used by the weaver performance tests and it still seems to give the same error: https://sems-cdash-son.sandia.gov/cdash/test/4142794 Any ideas? The mesh file did seem to change. Let me see if it actually did remove "solution".

jewatkins commented 8 months ago

Hmm solution is still there so maybe the script didn't work? I didn't receive any errors from the script.

bartgol commented 8 months ago

I used the script a while ago, so maybe I ended up hacking it somehow, and I gave you the wrong one? I'm not sure. Let me check.

bartgol commented 8 months ago

There are still some issues with parallel mesh IO, but I'll wait until the no-epetra PR is merged to avoid conflicts.

jewatkins commented 8 months ago

@bartgol I think this is the last failing test related to your PR. demoPDEs_AdvDiff https://sems-cdash-son.sandia.gov/cdash/test/4154994

Passed Feb. 21, Failed Feb. 26. It's a debug build if that helps at all.

I will post new issues for the other failing tests.

bartgol commented 8 months ago

Uhm, this is not as obvious as the other ones. I will debug to see what the deal is.

bartgol commented 8 months ago

On my laptop the test passes, so this is not something blatantly wrong. It will take me more time to figure this out.

bartgol commented 8 months ago

Btw, something is amiss in the Debug-NoWarn build:

"-DCMAKE_CXX_FLAGS:STRING='-O3 -Wall -Wno-clobbered -Wno-vla -Wno-pragmas -Wno-unknown-pragmas -Wno-unused-local-typedefs -Wno-literal-suffix -Wno-deprecated-declarations -Wno-misleading-indentation -Wno-int-in-bool-context -Wno-maybe-uninitialized -Wno-nonnull-compare -Wno-address -Wno-inline -Wno-return-type -Wno-mismatched-new-delete -Wno-catch-value -Wno-use-after-free -Werror'"

I would not expect to see -O3 in a debug build...

jewatkins commented 8 months ago

Btw, something is amiss in the Debug-NoWarn build:

"-DCMAKE_CXX_FLAGS:STRING='-O3 -Wall -Wno-clobbered -Wno-vla -Wno-pragmas -Wno-unknown-pragmas -Wno-unused-local-typedefs -Wno-literal-suffix -Wno-deprecated-declarations -Wno-misleading-indentation -Wno-int-in-bool-context -Wno-maybe-uninitialized -Wno-nonnull-compare -Wno-address -Wno-inline -Wno-return-type -Wno-mismatched-new-delete -Wno-catch-value -Wno-use-after-free -Werror'"

I would not expect to see -O3 in a debug build...

@kliegeois originally created this build. Not sure if O3 was intentional but I don't see any issue testing an optimized debug build. We have other debug builds with no optimizations.

jewatkins commented 8 months ago

The release build on blake runs fine: https://sems-cdash-son.sandia.gov/cdash/test/4154535 so definitely weird that the debug build fails.

bartgol commented 8 months ago

I ran on my workstation through valgrind, and I'm getting some legit errors. This brings back the idea (that I thought I shared a while ago) that we should find a way to run valgrind tests (on selected problems at least).

bartgol commented 8 months ago

240d6fbab should fix the nightlies. I tried running on blake, and the AdvDiff test now passes.

ikalash commented 8 months ago

Btw, something is amiss in the Debug-NoWarn build:
"-DCMAKE_CXX_FLAGS:STRING='-O3 -Wall -Wno-clobbered -Wno-vla -Wno-pragmas -Wno-unknown-pragmas -Wno-unused-local-typedefs -Wno-literal-suffix -Wno-deprecated-declarations -Wno-misleading-indentation -Wno-int-in-bool-context -Wno-maybe-uninitialized -Wno-nonnull-compare -Wno-address -Wno-inline -Wno-return-type -Wno-mismatched-new-delete -Wno-catch-value -Wno-use-after-free -Werror'"
I would not expect to see -O3 in a debug build...
@kliegeois originally created this build. Not sure if O3 was intentional but I don't see any issue testing an optimized debug build. We have other debug builds with no optimizations.

That's interesting about the flags. It's because that is how they are set in the blake debug build for Trilinos: https://github.com/sandialabs/Albany/blob/master/doc/dashboards/blake.sandia.gov/do-cmake-trilinos-gcc-debug . I am not sure the history of this. I suggest removing the '-O3'. Are there any objections to this?

bartgol commented 8 months ago

I also vote to remove -O3. While it is ok to optimize a debug build, I find it more useful to have a full release build (NDEBUG and opt flags ON) and a full debug build (no NDEBUG, no opt flags). This gives a wider span of scenarios when debugging failures.

ikalash commented 8 months ago

I have removed the '-O3' flag in the debug build on blake.

jewatkins commented 8 months ago

I have removed the '-O3' flag in the debug build on blake.

@ikalash You might need to remove it from the albany config too.

jewatkins commented 8 months ago

This is all fixed now. I think with https://github.com/sandialabs/Albany/commit/240d6fbab2f8b8b6c89af36f6773221d4472c1fc

sandialabs / Albany

Nightly test failures #1027