Closed ikalash closed 6 years ago
@ikalash Hi, there was a pull-request recently that brought in a stk update and that may have caused this IO issue. The stk team believes the issue has also affected another app, and a fix should be ready soon. We'll let you know when a new pull-request is ready. Is there any way for Trilinos pull-request testing to include some Albany tests?
@alanw0 : glad to hear you’re working on a fix. I think running Albany tests for a trillions PR is a great idea but I’m not sure about the logistics of getting this set up. We’d need help from a trillions testing / maintenance person. Perhaps @bartlettroscoe can comment on the feasibility of getting this set up.
@ikalash I'm guessing PR testing won't be able to include Albany tests if Albany isn't a trilinos package. The better solution in this case is for stk to have stronger unit testing which includes this IO scenario.
@alanw0 : you are probably right. I thought if you have PR testing for Drekar (also a stand-alone code depending on trillions) for example, then setting up the same thing for Albany should be straightforward.
CC: @trilinos/framework
@ikalash, I don't think we should add APP testing to Trilinos PR testing. For example, if an Albany test failed, how would a regular Trilinos developer fix Albany tests to allow the PR branch to get merged? And if this was a non-backward compatible change, this gets even more complex.
If you want to discuss a different model to better stabilize Albany / Trilinos development and intetgraiton then let me know and I can describe the option. Who is in charge of Albany / Trilinos integration efforts?
@bartlettroscoe : I agree. Although we try to keep a clean Albany dashboard, it does happen that people break Albany tests, which would be problematic here. There are a few of us who try to maintain stability of the Albany testing, most notably @ibaned (Dan Ibanez) and myself. Dan tends to follow Trilinos changes and their possible effect on Albany more than I do as he works on Kokkos.
Are there any updates on this issue, in particular an ETA on when it will be resolved? We still have ~60 failing tests in the Albany CDash. We'd really like to have a clean dashboard as soon as possible so that we can maintain code quality of Albany.
I’ve been out a couple days, I’ll check on this. I thought they had fixed it, I’ll check on the status of pushes/PRs. Sorry for the delay!
I'm guessing PR testing won't be able to include Albany tests if Albany isn't a trilinos package. The better solution in this case is for stk to have stronger unit testing which includes this IO scenario.
I agree with this assessment. In general, we are not planning to use customer tests for PR testing. There are too many cases where customer tests fail for reasons unrelated to Trilinos and we cannot clog the PR testing pipeline for that. We could reevaluate that for specific situations, but adding tests to Trilinos is a better option at the PR level.
I agree, but I think there was a general agreement that in some cases a commit could be reverted if it significantly impacts an application and if it would take too long to fix it. I think this decision ultimately belongs to the relevant product lead (@kddevin ). @ikalash how long do you think it would be reasonable to wait for a fix? Also a fix should come together with a Trilinos unit test that exposes that bug, so that we'll be able to catch that next time.
I concur with what @mperego said: I agree that determining whether or not to push a Trilinos PR based on Albany is not realistic; however in this case I am sure the problem is in Trilinos b/c all Albany tests pass if you use Albany version of the day and revert the problematic commit. The Albany tests have been failing for more than a week now, which is why I'd like to have this fixed as soon as possible. We will be doing some refactoring work in Albany and it is important that the dashboard is clean.
I checked with the STK guys working on this. They are doing pre-push testing now on the sierra side. Realistically the fix won't make it to Trilinos before Monday. I understand if you want a revert, since you've been broken for quite a while. From the point of view of the nalu application, which I work on, a revert would be painful since nalu has already adapted to some stk API changes. But again, if you want a revert I can't blame you.
If the PR fix is only a few days away, then I think it is ok for Albany to wait. We can hold off on merges/pushes for a bit, to avoid blending trilinos issues with potential in-house Albany ones.
Things change if the fix starts to slip a few weeks down the road, in which case we would have to come up with a plan B.
However, for the future, it may be helpful to identify a cut-off wait period: if there's a merge that brings in a substantial bug, and if a PR fix is predicted to be more than X days/weeks in the future, the Trilinos team may consider the possibility of a revert commit, to keep downstream customers builds clean. I'm not sure what X should be; maybe 1 or 2 weeks? How does that sound?
Thanks @alanw0. Do you guys have added a unit test in STK that exposes that bug?
@bartgol, We'll discuss this at Trilinos meetings, but I think it is easier to adopt a case by case solution rather than coming up with a cut-off wait period.
Everyone's being quite patient, we're sorry the fix has taken so long. In general the stk team would be fine with an immediate revert policy. In this case I was pleading for patience because of the nalu application having adapted to API changes that also came in with recent stk changes. Also, stk updates should be smaller and more frequent, rather than this large set of changes that came in with the first update in forever. Paul Wolfenbarger has done some great work setting up some scripts to pull stuff out of Sierra and into Trilinos so that updates should become almost automatic and certainly more frequent going forward.
In this case, the IO error appears to be the same as that coming from some Sierra tests that were broken at the same time. If it turns out that the Albany tests are not fixed by the coming stk fix, then the stk team will have to obtain albany and albany's tests to reproduce and fix. (And in that case, definitely revert while waiting.)
The stk team is generally very diligent about unit-testing, including test-driven-development most of the time. However some of the tests stay on the Sierra side and don't come out to Trilinos. I'm not sure where the newly-written IO tests are or will be.
@iklash, @alanw0, @khpierson
PR3416 should have fixed this, but I see these issues this morning. Is this testing done against master or develop?
We test Albany against the develop branch of Trilinos. It seems the PR should be in there already, yet the nightlies are still failing as you noted. Hmmm, it seems the PR did not fix the Albany issues.
@ikalash I'm looking at the Albany build instructions... Is this the right build command to use to reproduce the issues? % spack install --keep-stage albany ^trilinos@develop
I don't use spack to build Albany. I think @gahansen and @ibaned may know how to do that - I think they wrote those spack instructions. I can send you cmake configure instructions for Albany if you'd like.
ok. I've got that spack command chugging along right now so there's a chance it will work. Please do send your cmake instructions also, just in case.
That's great. I know spack is supposed to facilitate the build, but I've never tried it. What platform are you building on so that I send you the most relevant configure scripts?
I'm building on a Linux gcc platform. thanks!
You will find a sample Trilinos configure script for Linux gcc here: https://github.com/gahansen/Albany/blob/master/doc/nightlyTestHarness/do-cmake-trilinos-mpi-tpetra-no-scorec . For Albany, you should be able to see the problem if you just build demo PDEs:
cmake \
-D ALBANY_TRILINOS_DIR:PATH=${TRILINSTALLDIR} \
-D ENABLE_LCM:BOOL=OFF \
-D ENABLE_CONTACT:BOOL=OFF \
-D ENABLE_LCM_SPECULATIVE:BOOL=OFF \
-D ENABLE_HYDRIDE:BOOL=OFF \
-D ENABLE_SG:BOOL=OFF \
-D ENABLE_ENSEMBLE:BOOL=OFF \
-D ENABLE_LANDICE:BOOL=OFF \
-D ENABLE_TSUNAMI:BOOL=OFF \
-D ENABLE_AERAS:BOOL=OFF \
-D ENABLE_QCAD:BOOL=OFF \
-D ENABLE_MOR:BOOL=OFF \
-D ENABLE_ATO:BOOL=OFF \
-D ENABLE_ALBANY_EPETRA_EXE:BOOL=ON \
-D ENABLE_AMP:BOOL=OFF \
-D ENABLE_ASCR:BOOL=OFF \
-D ENABLE_CHECK_FPE:BOOL=OFF \
-D ENABLE_MPAS_INTERFACE:BOOL=OFF \
-D ENABLE_CISM_INTERFACE:BOOL=OFF \
-D ENABLE_DEMO_PDES:BOOL=ON \
..
Please let me know if you encounter any issues.
@ikalash I think I'm getting close to having a build. Can you point me to your dashboard? Or tell me a test which is showing the failure?
@ikalash I got my build working, and found that TransientHeat1D shows the IO error. We'll try to fix this immediately.
Great. Sorry for my non-responsiveness, I was away from my desk. The TransientHeat1D problem is a good one to look at, as it is failing currently and a pretty simple problem.
ok the stk team is actively working on this, they think they have a fix, and are working on a unit-test to reproduce. We'll have @prwolfe put in a PR as soon as we have a commit ready.
Thanks for the update @alanw0 ! I'm happy to test the fix once it's pushed.
@ikalash a PR is in progress which fixes the IO issue (the inconsistent number of nodeset variables). I notice that a couple of other tests are failing in my 'demo PDEs' build, they give this message: "The database type 'pamgen' is not supported.". Do you think that is due to my build not enabling everything, or is this something else potentially caused by stk changes? We don't think we've changed anything related to pamgen...
I just grepped through my output file from last night's nightlies and I don't see that error. Can you point me to which specific test had that error so I can double check?
It is SteadyHeat3DTest_A (and also the B, C, D versions of the test). For this 'demo PDEs' build (the cmake script you provided above), ctest runs 102 tests and these 4 failed.
@alanw0 : those are all passing in the dashboard for Albany, so the issue is likely due to your specific build. I can confirm once your PR gets pushed and the Albany nightlies have run with the changes.
@ikalash it looks like the stk PR made it into the develop branch. Fingers crossed for tonight's nightlies!
Great! I'll let you know tomorrow morning if things are back to normal with the Albany dashboard.
@alanw0 : it looks like your PR fixed the error reported above
p=2: *** Caught standard std::exception of type 'std::runtime_error' :
ERROR: Variable type counts are inconsistent. See processor 0 output for more details.
p=3: *** Caught standard std::exception of type 'std::runtime_error' :
ERROR: Variable type counts are inconsistent. See processor 0 output for more details.
p=0: *** Caught standard std::exception of type 'std::runtime_error' :
ERROR: Number of nodeset variables is not consistent on all processors.
Database: th1d_tpetra.exo
Processor 0 count = 3
Processor 1 count = 0
Processor 2 count = 0
However we will have a lot of failures in Albany, mostly for tests that use exodiff and other seacas utils (e.g., algebra) to compare output files with gold files. Did any of these utilities change recently? When I was looking into this a few weeks ago, removing the commit with the STK PR discussed here resulted in a clean dashboard (see here for discussion: https://github.com/gahansen/Albany/issues/358 ). I'd like to double check this however to try to understand what exactly is going on with the exodiff failures. Unfortunately our CDash is down right now. Hopefully it will come back up soon so that I can inspect the errors and compare to the successful output a few weeks ago to get some information on what is going on.
@ikalash we don't think those seacas utilities changed recently, but can't be sure. When you say 'exodiff failures', do you mean that exodiff is saying the files don't match (the results changed) or that exodiff is failing/crashing?
Exodiff runs, but the files are different. It looks like it's complaining that one of the files is missing some fields. Let me investigate a little bit to see if this is coming from a problem in Albany or Trilinos.
That test obviously didn't show up in my "demo PDEs" run. Can you give me the flags to enable so that I see that test? Then we can help chase this down.
@alanw0 : sorry about the delay I wanted to try a few things before writing you back.
You are right, the tests that have this issue weren't on in your build. You can turn them on by setting:
-D ENABLE_LCM:BOOL=ON \
in your Albany cmake script. One test that fails is this one: tests/small/LCM/RigidBody. Here is what seems to be happening:
In a build with Trilinos develop and Albany master from today, exodiff comparisons fail as follows:
1: FILE 1: /home/ikalash/nightlyCDash/build/IKTAlbany/tests/small/LCM/RigidBody/RigidRotation.exo
1: Title: IOSS Default Output Title
1: Dim = 3, Blocks = 1, Nodes = 64, Elements = 27, Nodesets = 4, Sidesets = 0
1: Vars: Global = 0, Nodal = 11, Element = 160, Nodeset = 0, Sideset = 0, Times = 13
1:
1: FILE 2: /home/ikalash/nightlyCDash/repos/Albany/tests/small/LCM/RigidBody/RigidRotation.gold.exo
1: Title: Sierra Output Default Title
1: Dim = 3, Blocks = 1, Nodes = 64, Elements = 27, Nodesets = 4, Sidesets = 0
1: Vars: Global = 0, Nodal = 11, Element = 160, Nodeset = 0, Sideset = 0, Times = 13
1:
1: COMMAND FILE: /home/ikalash/nightlyCDash/repos/Albany/tests/small/LCM/RigidBody/RigidRotation.exodiff
1:
1: exodiff: DIFFERENCE .. Specified element variable "Cauchy_Stress_01" is not in the first file.
1:
1: exodiff: DIFFERENCE .. Specified element variable "Cauchy_Stress_02" is not in the first file.
1:
The first file is the file generated by the run. It looks like element variables like the Cauchy Stresses are no longer being written to this file, whereas they used to be.
Now, it would seem that we changed something in Albany to cause this behavior. However, if I pull this Trilinos (that just happened to be the one I was using when initially diagnosing the problems; I am pretty confident if I used Trilinos of the day I'd get the same behavior) 9bf901f0732da8c6f3629911986bd97b0c48dd9f with the following revert:
Revert "Merge Pull Request #2930 from prwolfe/Trilinos/import_stk"
This reverts commit 1d835e3595226bebef56c28862e4e8425af9431e, reversing
changes made to 22c79ca9434ce101fc4c4200a90c5d9e42692ca5.
and use Albany master from today, the test passes (so it has all the relevant element variables).
It seems the PR #2930 somehow implicitly turned off writing element variables to the mesh - is that possible? Maybe they were written before by default and that default behavior has changed?
@ikalash ok thanks for the detailed info. I'm surprised that the stk PR turned off an writing element variable, but I can't rule it out. I'll use -DENABLE_LCM=ON and work on reproducing this issue asap.
@alanw0 : thanks. I can also see what happens with a Trilinos of the day with the revert of PR #2930 (though there might be conflicts with your most recent PR). Let me know if this would be useful.
@ikalash No need for you to do anything else right now. I'm doing the LCM=ON Albany build (it's taking a little effort because that required some extra things enabled in my trilinos config...). We hope to have this debugged (and fixed) today. I'll keep you posted.
Sounds great, thanks!
@ikalash We looked at the RigidBody test. The result file contains all of the fields, but their names have changed. This is due to some "improvements" in the way stk-io interprets field types. The gold file contains Cauchy_Stress_01 .. Cauchy_Stress_72 (i.e., a 1D vector field of length 72), while the new result file contains Cauchy_Stress_1_01 .. Cauchy_Stress_3_24. (i.e., a 2D field which is 3 vec-24 fields). This is influenced by how you declare the stk-field, and how you call stk::mesh::put_field.
It looks like Albany declares Cauchy_Stress as stk::mesh::Field
So there are a couple of options:
Sorry for this inconvenience. (And sorry for the implicit API/behavior change.) These changes came about because some users had new kinds of fancy field types, but exodus really only represents scalar fields (which is why displacements get stored as displ_x, displ_y, displ_z, etc.). So stk-io does some gymnastics to force fields into exodus, and those manipulations have changed a little to try to represent things a little more richly.
Let us know if you have questions or need more information. Once again, sorry for the ordeal!
@alanw0 : thanks for looking into this! Yup, this convention change would cause the failures we are seeing.
Regarding your question 3.: yes, exactly, each of the stress variables is the value of the variable at a Gauss point (which is why there are so many). I'm not sure why what you suggest was not done before - there may have been a reason (tagging the people who would know below).
Tagging @jwfoulk and @lxmota , the main developers of these tests - Jay, Alejandro, how would you prefer to proceed regarding fixing the failing LCM exodiff tests (see above)? I think changing the gold files is probably too much of a pain given that there are so many gold files. Option 2 seems like the quickest way to fix this. It might be best for one of you to do this since you know the stress field layouts better than me; or maybe we can look at it together. 3. seems like a better overall approach but may be more work - perhaps you have looked into this in the past and decided not to go this route for some reason?
We need to discuss this more thoroughly, since it will affect many tests and people.
@lxmota : agreed which is why I tagged you and Jay. Let me know if you want to meet up to discuss this, or you may be planning to bring it up in one of your regular group meetings.
@ikalash Let's discuss it first among Albany devs, so I can have a better understanding of the available options to fix this. I'm not sure that I can bring this up to the wider LCM group before the end of the FY, so we'll take it from there.
OK sounds good. I think we can probably go ahead and close this Trilinos issue and move the discussion to the relevant Albany issue (https://github.com/gahansen/Albany/issues/358), since @alanw0 has fixed the STK issues that were present and pointed to changes required on the Albany side. If you agree, I'll go ahead and close this issue.
@ikalash Sounds good to me. Thank you for tracking this down.
The following PR broke a number of Albany tests last week: https://github.com/trilinos/Trilinos/commit/1d835e3595226bebef56c28862e4e8425af9431e . I have verified that reverting the commit yields a clean dashboard. The issue was originally discussed here in Albany issues: https://github.com/gahansen/Albany/issues/358 - transferring it to a Trilinos issue now that I have verified the problem is with Trilinos. To see verbose output from the failing tests, one can click on any of the tests here: http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=75093 (they all pass with the PR reverted). Most of them seem to be failures when running exodiff for LCM problems, but a few seem to have to do with reading in an Exodus file to restart from, e.g. http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=3880312&build=75103:
Can someone please look into this? We'd like to get a clean Albany dashboard again as soon as possible. Tagging Albany developers who are likely interested in this: @bartgol, @lxmota, @ibaned, @jwfoulk, @mperego.
@trilinos/stk, @trilinos/seacas