Closed jewatkins closed 8 months ago
Hi @jewatkins, thanks for checking this. The core and landice tests will be fixed as soon as PR https://github.com/trilinos/Trilinos/pull/12749 gets merged. I fixed the sensitivity checking in Albany and now an issue with Piro is exposed. Hopefully tomorrow will see them pass.
I'll fix the MPAS interface issue appearing on Blake test with a commit today.
@ikalash attaway and cee-compute004 (icc) build failures seem to be correlated to the cee/hpc project space consolidation.
I checked on cee-compute004, and I think it's OK. I manually was able to configure against the nightly-build Trilinos. Lets see what happened tomorrow.
Regarding attaway, it seems the issue is that the sems modules are not there anymore. Do you know anything about this @jewatkins ? I haven't been following the HPC announcements very well. I can reach out to the sems team if this is a real problem.
@ikalash Yes, I looked a little more into it. I think the sems project space was moved/removed and sems needs to replace it. I have a ticket for sems-archive because this also broke modules that e3sm uses. I imagine the same fix will fix our module issue but if not, I'll ask about the regular sems modules.
Thanks @jewatkins . I will submit a separate ticket with the sems team now.
Missed a few places that needed fixes. I'll push a commit soon that should fix the remaining issues, including the PyAlbany ones.
Most systems look good now except cee-compute003 | Trilinos-Linux-3.10.0-1062.1.2.el7.x86_64-gcc-10.1.0-Debug-Serial Albany build is failing because it can't find trilinos.
@ikalash Any ideas on what's happening there?
Most systems look good now except cee-compute003 | Trilinos-Linux-3.10.0-1062.1.2.el7.x86_64-gcc-10.1.0-Debug-Serial Albany build is failing because it can't find trilinos.
@ikalash Any ideas on what's happening there?
Sorry for the delay in replying. I think cee-compute003 is just flakey. It looks like last night the Trilinos build ran and Albany is still going.
Seems like there are still a lot of failures today. I noticed that the CALI problems are nan-ing out now (solver not converging: https://sems-cdash-son.sandia.gov/cdash/test/4108631). I know we don't care about CALI anymore, but it would be good to understand what's happening in case it's affecting other Albany problems.
There are a lot of failures. I don't think any build is completely clean today. @bartgol looks like you pushed yesterday, any ideas? https://sems-cdash-son.sandia.gov/cdash/index.php?project=Albany
It is possible that while fixing the last few MPAS-related errors I messed up something else... I should have re-run the whole suite, rather than pushing. I will double check (on Monday, at this point).
Other errors, however, look unrelated, like this
/scratch/albany/build/Albany64BitClang/src/Albany: error while loading shared libraries: libpanzer-expr-eval.so.15: cannot open shared object file: No such file or directory
I see the issue. Fix coming.
Some fails should be fixed by 972e86c9. However, I don't know what to do about the tests failing due to missing shared libs... We should log on the test platforms and see if those trilinos libs are indeed installed.
Lots of tests seem to have been fixed. We still have a handful of failures related to the humboldt mesh. Not on all builds though. I suspect this is due to whether the offline mesh partitioning succeeded or not. I will verify that.
There are also the perf tests that keep failing, but they have been failing since cdahs has any record of them (June last year). IIRC, the issue is making the target value in the regression tests machine dependent. Which we haven't yet found a solution for.
There are a lot of different test failures now (I don't see any lib failures). In regards to perf. tests, I only see the weaver perf. tests failing with
Error: std::mesh::MetaData::declare_field_restriction FAILED for Field<double>["solution", #states: 1] FieldRestriction[ selector: "{UNIVERSAL}", dimension: 1, scalars per entity: 1 ] WITH INCOMPATIBLE REDECLARATION FieldRestriction[ selector: "{UNIVERSAL}", dimension: 2, scalars per entity: 2 ]
which seems new to me. I might be able to look into all this more next week to see what caused it.
I tried running one of my PyAlbany tuning cases (ant-4-20km) earlier today and ran into the same error.
Yes, that's due to the solution field being in the input file. Since we moved to STK "simple fields" (the STK ppl seemed to "encourage" this transition), we can no longer re-declare a field with a different layout. In this case, the input exo file has a field "solution" with 1 component, but the problem tries to re-declare it with 2 components.
In normal albany tests, I fixed this by manually eliminating the "solution" field from the input mesh files. The following script can be used for this. Running ./the_script input.exo solution
should remove "solution" from the file "input.exo".
#!/bin/bash
if [ "$#" -ne 2 ]; then
echo "Illegal number of parameters: $#. Usage:"
echo " remove_var_from_exofile <filename> <varname>"
exit 1
fi
fname=$1
vname=$2
# The awk-force is strong in us...
nvars=$(ncdump -h $fname | grep 'num_nod_var =' | awk '{ print $3 }')
all_names=$(ncdump -v name_nod_var $fname | awk '/name_nod_var =/,EOF { print $0 }' | tail -n +2 | head -n $nvars | sed 's/[",\ ;]//g')
ivar=$(echo "$all_names" | awk -v pattern="$vname" '$0 ~ pattern { print FNR}')
# Shift nodal vars: copy i+1 onto i
for i in $(seq $ivar $((nvars-1))); do
ncap2 -s "vals_nod_var${i}=vals_nod_var$((i+1))*1" -O $fname $fname
done
# Remove last nodal var
ncks -x -v vals_nod_var${nvars} -O $fname $fname
# Shift var names: copy i+1 onto i
for i in $(seq $ivar $((nvars-1))); do
prev=$((i-1))
cmd="ncap2 -O -s 'name_nod_var($prev,:)=name_nod_var($i,:)' $fname $fname"
eval $cmd
done
# Slice num_nod_var dimension
ncks -d num_nod_var,0,$((nvars-2)) -O $fname $fname
# Prune history
ncatted -h -a history,global,d,, -O $fname $fname
Note: the switch to "simple fields" in STK was done in #1010 .
Okay I can try this and that should fix some of the performance tests. @bartgol were you going to look at the other two failures?
ERROR: Could not open file '/scratch/albany/build/Albany64Bit/tests/landIce/AsciiMeshes/Humboldt/humboldt_2d.exo.4.0', error = No such file or directory.
Error! Piro::TempusSolver: time-integrator did not make it to final time specified in Input File. Final time in input file is 0.5, whereas actual final time is 0. If you'd like to suppress this exception, run with 'Abort on Failure' set to 'false' in Tempus sublist.
ERROR: Could not open file '/scratch/albany/build/Albany64Bit/tests/landIce/AsciiMeshes/Humboldt/humboldt_2d.exo.4.0', error = No such file or directory.
I think that failure is b/c we don't have the partitioned file in the repo anymore. The cmake logic is supposed to run the seacas util to partition the mesh before running the test. I thought that if seacas IOP is not enabled, the test would not run anyways. I will double check if there's something amiss in the CMake logic.
Error! Piro::TempusSolver: time-integrator did not make it to final time specified in Input File. Final time in input file is 0.5, whereas actual final time is 0. If you'd like to suppress this exception, run with 'Abort on Failure' set to 'false' in Tempus sublist.
This is hard to diagnose from the error msg. I will look at cdash, and see if there's any previous sign of what could be going wrong.
Ok, I see that in the builds ALBANY_PARALLEL_EXODUS=ON
, so we should load a serial mesh, and distribute online. That's what we do for StokesFO tests, but it appears that the humboldt tests don't.
Velocity tests input files have Use Serial Mesh: ${USE_SERIAL_MESH}
, where USE_SERIAL_MESH
expands to true
if ALBANY_PARALLEL_EXODUS=ON
, while the humboldt ones have Use Serial Mesh: false
. @mperego I think you introduced the USE_SERIAL_MESH
logic: can we use that also for humboldt tests?
I don't think I introduced the logic, but sure, it makes sense to use it for the Humboldt tests as well. I'm working on converting some of the Epetra tests to Tpetra, and I can add this change as well.
I suppose we should just set all tests to Use Serial Mesh: true
if the decomposed files are no longer there?
Well, that's what the USE_SERIAL_MESH
cmake var does: if ALBANY_PARALLEL_EXODUS=ON
, it's true
, otherwise it's false
(and a FIXTURE_SETUP
test is run to "prepare" the mesh decomposition).
Yes, that's due to the solution field being in the input file. Since we moved to STK "simple fields" (the STK ppl seemed to "encourage" this transition), we can no longer re-declare a field with a different layout. In this case, the input exo file has a field "solution" with 1 component, but the problem tries to re-declare it with 2 components.
In normal albany tests, I fixed this by manually eliminating the "solution" field from the input mesh files. The following script can be used for this. Running
./the_script input.exo solution
should remove "solution" from the file "input.exo".
@bartgol I tried applying this to the meshes used by the weaver performance tests and it still seems to give the same error: https://sems-cdash-son.sandia.gov/cdash/test/4142794 Any ideas? The mesh file did seem to change. Let me see if it actually did remove "solution".
Hmm solution is still there so maybe the script didn't work? I didn't receive any errors from the script.
I used the script a while ago, so maybe I ended up hacking it somehow, and I gave you the wrong one? I'm not sure. Let me check.
There are still some issues with parallel mesh IO, but I'll wait until the no-epetra PR is merged to avoid conflicts.
@bartgol I think this is the last failing test related to your PR. demoPDEs_AdvDiff https://sems-cdash-son.sandia.gov/cdash/test/4154994
Passed Feb. 21, Failed Feb. 26. It's a debug build if that helps at all.
I will post new issues for the other failing tests.
Uhm, this is not as obvious as the other ones. I will debug to see what the deal is.
On my laptop the test passes, so this is not something blatantly wrong. It will take me more time to figure this out.
Btw, something is amiss in the Debug-NoWarn build:
"-DCMAKE_CXX_FLAGS:STRING='-O3 -Wall -Wno-clobbered -Wno-vla -Wno-pragmas -Wno-unknown-pragmas -Wno-unused-local-typedefs -Wno-literal-suffix -Wno-deprecated-declarations -Wno-misleading-indentation -Wno-int-in-bool-context -Wno-maybe-uninitialized -Wno-nonnull-compare -Wno-address -Wno-inline -Wno-return-type -Wno-mismatched-new-delete -Wno-catch-value -Wno-use-after-free -Werror'"
I would not expect to see -O3
in a debug build...
Btw, something is amiss in the Debug-NoWarn build:
"-DCMAKE_CXX_FLAGS:STRING='-O3 -Wall -Wno-clobbered -Wno-vla -Wno-pragmas -Wno-unknown-pragmas -Wno-unused-local-typedefs -Wno-literal-suffix -Wno-deprecated-declarations -Wno-misleading-indentation -Wno-int-in-bool-context -Wno-maybe-uninitialized -Wno-nonnull-compare -Wno-address -Wno-inline -Wno-return-type -Wno-mismatched-new-delete -Wno-catch-value -Wno-use-after-free -Werror'"
I would not expect to see
-O3
in a debug build...
@kliegeois originally created this build. Not sure if O3 was intentional but I don't see any issue testing an optimized debug build. We have other debug builds with no optimizations.
The release build on blake runs fine: https://sems-cdash-son.sandia.gov/cdash/test/4154535 so definitely weird that the debug build fails.
I ran on my workstation through valgrind, and I'm getting some legit errors. This brings back the idea (that I thought I shared a while ago) that we should find a way to run valgrind tests (on selected problems at least).
240d6fbab should fix the nightlies. I tried running on blake, and the AdvDiff test now passes.
Btw, something is amiss in the Debug-NoWarn build:
"-DCMAKE_CXX_FLAGS:STRING='-O3 -Wall -Wno-clobbered -Wno-vla -Wno-pragmas -Wno-unknown-pragmas -Wno-unused-local-typedefs -Wno-literal-suffix -Wno-deprecated-declarations -Wno-misleading-indentation -Wno-int-in-bool-context -Wno-maybe-uninitialized -Wno-nonnull-compare -Wno-address -Wno-inline -Wno-return-type -Wno-mismatched-new-delete -Wno-catch-value -Wno-use-after-free -Werror'"
I would not expect to see
-O3
in a debug build...@kliegeois originally created this build. Not sure if O3 was intentional but I don't see any issue testing an optimized debug build. We have other debug builds with no optimizations.
That's interesting about the flags. It's because that is how they are set in the blake debug build for Trilinos: https://github.com/sandialabs/Albany/blob/master/doc/dashboards/blake.sandia.gov/do-cmake-trilinos-gcc-debug . I am not sure the history of this. I suggest removing the '-O3'. Are there any objections to this?
I also vote to remove -O3
. While it is ok to optimize a debug build, I find it more useful to have a full release build (NDEBUG and opt flags ON) and a full debug build (no NDEBUG, no opt flags). This gives a wider span of scenarios when debugging failures.
I have removed the '-O3' flag in the debug build on blake.
I have removed the '-O3' flag in the debug build on blake.
@ikalash You might need to remove it from the albany config too.
This is all fixed now. I think with https://github.com/sandialabs/Albany/commit/240d6fbab2f8b8b6c89af36f6773221d4472c1fc
@mperego There are a lot of failures in the nightlies. Could these be related to some of your recent changes?
Failing tests: corePDEs_SteadyHeatConstrainedOpt2D_Dirichlet_Mixed_Params_Epetra | Failed | 33s 590ms | Completed (Failed) | Unstable | Broken corePDEs_SteadyHeatConstrainedOpt2D_Dirichlet_Mixed_Params_Tpetra | Failed | 33s 960ms | Completed (Failed) | Unstable | Broken corePDEs_SteadyHeatConstrainedOpt2D_Scalar_And_Dist_Param_Tpetra | Failed | 44s 320ms | Completed (Failed) | Unstable | Broken landIce_FO_GIS_AdjointSensitivity_Epetra | Failed | 24s 40ms | Completed (Failed) | Unstable | Broken landIce_FO_GIS_AdjointSensitivity_StiffeningBasalFriction_Epetra | Failed | 45s 480ms | Completed (Failed) | Unstable | Broken landIce_FO_GIS_AdjointSensitivity_StiffeningBasalFriction_Tpetra | Failed | 47s 890ms | Completed (Failed) | Unstable | Broken landIce_FO_GIS_AdjointSensitivity_Thickness_MoveSurfHeightAndBed | Failed | 29s 580ms | Completed (Failed) | Unstable | Broken landIce_FO_GIS_AdjointSensitivity_Tpetra | Failed | 34s 660ms | Completed (Failed) | Unstable | Broken landIce_FO_GIS_AdjointSensitivity_TwoParameters_Epetra | Failed | 27s 400ms | Completed (Failed) | Unstable | Broken landIce_FO_GIS_AdjointSensitivity_TwoParameters_Tpetra | Failed | 27s 640ms | Completed (Failed) | Unstable | Broken
Blake builds seem to be broken, one example: Albany/src/landIce/interfaceWithMPAS/Interface.cpp:315:37: error: no matching function for call to 'Albany::SolverFactory::createSolver(Teuchos::RCP<const Teuchos::Comm >&, Teuchos::RCP&)'
@ikalash attaway and cee-compute004 (icc) build failures seem to be correlated to the cee/hpc project space consolidation.