trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 564 forks source link

STK PR last week broke a number of Albany tests #3377

Closed ikalash closed 6 years ago

ikalash commented 6 years ago

The following PR broke a number of Albany tests last week: https://github.com/trilinos/Trilinos/commit/1d835e3595226bebef56c28862e4e8425af9431e . I have verified that reverting the commit yields a clean dashboard. The issue was originally discussed here in Albany issues: https://github.com/gahansen/Albany/issues/358 - transferring it to a Trilinos issue now that I have verified the problem is with Trilinos. To see verbose output from the failing tests, one can click on any of the tests here: http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=75093 (they all pass with the PR reverted). Most of them seem to be failures when running exodiff for LCM problems, but a few seem to have to do with reading in an Exodus file to restart from, e.g. http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=3880312&build=75103:

p=2: *** Caught standard std::exception of type 'std::runtime_error' :
 ERROR: Variable type counts are inconsistent. See processor 0 output for more details.
p=3: *** Caught standard std::exception of type 'std::runtime_error' :
 ERROR: Variable type counts are inconsistent. See processor 0 output for more details.
p=0: *** Caught standard std::exception of type 'std::runtime_error' :
 ERROR: Number of nodeset variables is not consistent on all processors.
        Database: th1d_tpetra.exo
    Processor 0 count = 3
    Processor 1 count = 0
    Processor 2 count = 0

Can someone please look into this? We'd like to get a clean Albany dashboard again as soon as possible. Tagging Albany developers who are likely interested in this: @bartgol, @lxmota, @ibaned, @jwfoulk, @mperego.

@trilinos/stk, @trilinos/seacas

alanw0 commented 6 years ago

@ikalash Hi, there was a pull-request recently that brought in a stk update and that may have caused this IO issue. The stk team believes the issue has also affected another app, and a fix should be ready soon. We'll let you know when a new pull-request is ready. Is there any way for Trilinos pull-request testing to include some Albany tests?

ikalash commented 6 years ago

@alanw0 : glad to hear you’re working on a fix. I think running Albany tests for a trillions PR is a great idea but I’m not sure about the logistics of getting this set up. We’d need help from a trillions testing / maintenance person. Perhaps @bartlettroscoe can comment on the feasibility of getting this set up.

alanw0 commented 6 years ago

@ikalash I'm guessing PR testing won't be able to include Albany tests if Albany isn't a trilinos package. The better solution in this case is for stk to have stronger unit testing which includes this IO scenario.

ikalash commented 6 years ago

@alanw0 : you are probably right. I thought if you have PR testing for Drekar (also a stand-alone code depending on trillions) for example, then setting up the same thing for Albany should be straightforward.

bartlettroscoe commented 6 years ago

CC: @trilinos/framework

@ikalash, I don't think we should add APP testing to Trilinos PR testing. For example, if an Albany test failed, how would a regular Trilinos developer fix Albany tests to allow the PR branch to get merged? And if this was a non-backward compatible change, this gets even more complex.

If you want to discuss a different model to better stabilize Albany / Trilinos development and intetgraiton then let me know and I can describe the option. Who is in charge of Albany / Trilinos integration efforts?

ikalash commented 6 years ago

@bartlettroscoe : I agree. Although we try to keep a clean Albany dashboard, it does happen that people break Albany tests, which would be problematic here. There are a few of us who try to maintain stability of the Albany testing, most notably @ibaned (Dan Ibanez) and myself. Dan tends to follow Trilinos changes and their possible effect on Albany more than I do as he works on Kokkos.

ikalash commented 6 years ago

Are there any updates on this issue, in particular an ETA on when it will be resolved? We still have ~60 failing tests in the Albany CDash. We'd really like to have a clean dashboard as soon as possible so that we can maintain code quality of Albany.

alanw0 commented 6 years ago

I’ve been out a couple days, I’ll check on this. I thought they had fixed it, I’ll check on the status of pushes/PRs. Sorry for the delay!

jwillenbring commented 6 years ago

I'm guessing PR testing won't be able to include Albany tests if Albany isn't a trilinos package. The better solution in this case is for stk to have stronger unit testing which includes this IO scenario.

I agree with this assessment. In general, we are not planning to use customer tests for PR testing. There are too many cases where customer tests fail for reasons unrelated to Trilinos and we cannot clog the PR testing pipeline for that. We could reevaluate that for specific situations, but adding tests to Trilinos is a better option at the PR level.

mperego commented 6 years ago

I agree, but I think there was a general agreement that in some cases a commit could be reverted if it significantly impacts an application and if it would take too long to fix it. I think this decision ultimately belongs to the relevant product lead (@kddevin ). @ikalash how long do you think it would be reasonable to wait for a fix? Also a fix should come together with a Trilinos unit test that exposes that bug, so that we'll be able to catch that next time.

ikalash commented 6 years ago

I concur with what @mperego said: I agree that determining whether or not to push a Trilinos PR based on Albany is not realistic; however in this case I am sure the problem is in Trilinos b/c all Albany tests pass if you use Albany version of the day and revert the problematic commit. The Albany tests have been failing for more than a week now, which is why I'd like to have this fixed as soon as possible. We will be doing some refactoring work in Albany and it is important that the dashboard is clean.

alanw0 commented 6 years ago

I checked with the STK guys working on this. They are doing pre-push testing now on the sierra side. Realistically the fix won't make it to Trilinos before Monday. I understand if you want a revert, since you've been broken for quite a while. From the point of view of the nalu application, which I work on, a revert would be painful since nalu has already adapted to some stk API changes. But again, if you want a revert I can't blame you.

bartgol commented 6 years ago

If the PR fix is only a few days away, then I think it is ok for Albany to wait. We can hold off on merges/pushes for a bit, to avoid blending trilinos issues with potential in-house Albany ones.

Things change if the fix starts to slip a few weeks down the road, in which case we would have to come up with a plan B.

However, for the future, it may be helpful to identify a cut-off wait period: if there's a merge that brings in a substantial bug, and if a PR fix is predicted to be more than X days/weeks in the future, the Trilinos team may consider the possibility of a revert commit, to keep downstream customers builds clean. I'm not sure what X should be; maybe 1 or 2 weeks? How does that sound?

mperego commented 6 years ago

Thanks @alanw0. Do you guys have added a unit test in STK that exposes that bug?

@bartgol, We'll discuss this at Trilinos meetings, but I think it is easier to adopt a case by case solution rather than coming up with a cut-off wait period.

alanw0 commented 6 years ago

Everyone's being quite patient, we're sorry the fix has taken so long. In general the stk team would be fine with an immediate revert policy. In this case I was pleading for patience because of the nalu application having adapted to API changes that also came in with recent stk changes. Also, stk updates should be smaller and more frequent, rather than this large set of changes that came in with the first update in forever. Paul Wolfenbarger has done some great work setting up some scripts to pull stuff out of Sierra and into Trilinos so that updates should become almost automatic and certainly more frequent going forward.

In this case, the IO error appears to be the same as that coming from some Sierra tests that were broken at the same time. If it turns out that the Albany tests are not fixed by the coming stk fix, then the stk team will have to obtain albany and albany's tests to reproduce and fix. (And in that case, definitely revert while waiting.)

The stk team is generally very diligent about unit-testing, including test-driven-development most of the time. However some of the tests stay on the Sierra side and don't come out to Trilinos. I'm not sure where the newly-written IO tests are or will be.

prwolfe commented 6 years ago

@iklash, @alanw0, @khpierson

PR3416 should have fixed this, but I see these issues this morning. Is this testing done against master or develop?

ikalash commented 6 years ago

We test Albany against the develop branch of Trilinos. It seems the PR should be in there already, yet the nightlies are still failing as you noted. Hmmm, it seems the PR did not fix the Albany issues.

alanw0 commented 6 years ago

@ikalash I'm looking at the Albany build instructions... Is this the right build command to use to reproduce the issues? % spack install --keep-stage albany ^trilinos@develop

ikalash commented 6 years ago

I don't use spack to build Albany. I think @gahansen and @ibaned may know how to do that - I think they wrote those spack instructions. I can send you cmake configure instructions for Albany if you'd like.

alanw0 commented 6 years ago

ok. I've got that spack command chugging along right now so there's a chance it will work. Please do send your cmake instructions also, just in case.

ikalash commented 6 years ago

That's great. I know spack is supposed to facilitate the build, but I've never tried it. What platform are you building on so that I send you the most relevant configure scripts?

alanw0 commented 6 years ago

I'm building on a Linux gcc platform. thanks!

ikalash commented 6 years ago

You will find a sample Trilinos configure script for Linux gcc here: https://github.com/gahansen/Albany/blob/master/doc/nightlyTestHarness/do-cmake-trilinos-mpi-tpetra-no-scorec . For Albany, you should be able to see the problem if you just build demo PDEs:

cmake \
    -D ALBANY_TRILINOS_DIR:PATH=${TRILINSTALLDIR} \
    -D ENABLE_LCM:BOOL=OFF \
    -D ENABLE_CONTACT:BOOL=OFF \
    -D ENABLE_LCM_SPECULATIVE:BOOL=OFF \
    -D ENABLE_HYDRIDE:BOOL=OFF \
    -D ENABLE_SG:BOOL=OFF \
    -D ENABLE_ENSEMBLE:BOOL=OFF \
    -D ENABLE_LANDICE:BOOL=OFF \
    -D ENABLE_TSUNAMI:BOOL=OFF \
    -D ENABLE_AERAS:BOOL=OFF \
    -D ENABLE_QCAD:BOOL=OFF \
    -D ENABLE_MOR:BOOL=OFF \
    -D ENABLE_ATO:BOOL=OFF \
    -D ENABLE_ALBANY_EPETRA_EXE:BOOL=ON \
    -D ENABLE_AMP:BOOL=OFF \
    -D ENABLE_ASCR:BOOL=OFF \
    -D ENABLE_CHECK_FPE:BOOL=OFF \
    -D ENABLE_MPAS_INTERFACE:BOOL=OFF \
    -D ENABLE_CISM_INTERFACE:BOOL=OFF \
    -D ENABLE_DEMO_PDES:BOOL=ON \
..

Please let me know if you encounter any issues.

alanw0 commented 6 years ago

@ikalash I think I'm getting close to having a build. Can you point me to your dashboard? Or tell me a test which is showing the failure?

alanw0 commented 6 years ago

@ikalash I got my build working, and found that TransientHeat1D shows the IO error. We'll try to fix this immediately.

ikalash commented 6 years ago

Great. Sorry for my non-responsiveness, I was away from my desk. The TransientHeat1D problem is a good one to look at, as it is failing currently and a pretty simple problem.

alanw0 commented 6 years ago

ok the stk team is actively working on this, they think they have a fix, and are working on a unit-test to reproduce. We'll have @prwolfe put in a PR as soon as we have a commit ready.

ikalash commented 6 years ago

Thanks for the update @alanw0 ! I'm happy to test the fix once it's pushed.

alanw0 commented 6 years ago

@ikalash a PR is in progress which fixes the IO issue (the inconsistent number of nodeset variables). I notice that a couple of other tests are failing in my 'demo PDEs' build, they give this message: "The database type 'pamgen' is not supported.". Do you think that is due to my build not enabling everything, or is this something else potentially caused by stk changes? We don't think we've changed anything related to pamgen...

ikalash commented 6 years ago

I just grepped through my output file from last night's nightlies and I don't see that error. Can you point me to which specific test had that error so I can double check?

alanw0 commented 6 years ago

It is SteadyHeat3DTest_A (and also the B, C, D versions of the test). For this 'demo PDEs' build (the cmake script you provided above), ctest runs 102 tests and these 4 failed.

ikalash commented 6 years ago

@alanw0 : those are all passing in the dashboard for Albany, so the issue is likely due to your specific build. I can confirm once your PR gets pushed and the Albany nightlies have run with the changes.

alanw0 commented 6 years ago

@ikalash it looks like the stk PR made it into the develop branch. Fingers crossed for tonight's nightlies!

ikalash commented 6 years ago

Great! I'll let you know tomorrow morning if things are back to normal with the Albany dashboard.

ikalash commented 6 years ago

@alanw0 : it looks like your PR fixed the error reported above

p=2: *** Caught standard std::exception of type 'std::runtime_error' :
 ERROR: Variable type counts are inconsistent. See processor 0 output for more details.
p=3: *** Caught standard std::exception of type 'std::runtime_error' :
 ERROR: Variable type counts are inconsistent. See processor 0 output for more details.
p=0: *** Caught standard std::exception of type 'std::runtime_error' :
 ERROR: Number of nodeset variables is not consistent on all processors.
        Database: th1d_tpetra.exo
    Processor 0 count = 3
    Processor 1 count = 0
    Processor 2 count = 0

However we will have a lot of failures in Albany, mostly for tests that use exodiff and other seacas utils (e.g., algebra) to compare output files with gold files. Did any of these utilities change recently? When I was looking into this a few weeks ago, removing the commit with the STK PR discussed here resulted in a clean dashboard (see here for discussion: https://github.com/gahansen/Albany/issues/358 ). I'd like to double check this however to try to understand what exactly is going on with the exodiff failures. Unfortunately our CDash is down right now. Hopefully it will come back up soon so that I can inspect the errors and compare to the successful output a few weeks ago to get some information on what is going on.

alanw0 commented 6 years ago

@ikalash we don't think those seacas utilities changed recently, but can't be sure. When you say 'exodiff failures', do you mean that exodiff is saying the files don't match (the results changed) or that exodiff is failing/crashing?

ikalash commented 6 years ago

Exodiff runs, but the files are different. It looks like it's complaining that one of the files is missing some fields. Let me investigate a little bit to see if this is coming from a problem in Albany or Trilinos.

alanw0 commented 6 years ago

That test obviously didn't show up in my "demo PDEs" run. Can you give me the flags to enable so that I see that test? Then we can help chase this down.

ikalash commented 6 years ago

@alanw0 : sorry about the delay I wanted to try a few things before writing you back.

You are right, the tests that have this issue weren't on in your build. You can turn them on by setting:

-D ENABLE_LCM:BOOL=ON \

in your Albany cmake script. One test that fails is this one: tests/small/LCM/RigidBody. Here is what seems to be happening:

In a build with Trilinos develop and Albany master from today, exodiff comparisons fail as follows:

1:   FILE 1: /home/ikalash/nightlyCDash/build/IKTAlbany/tests/small/LCM/RigidBody/RigidRotation.exo
1:    Title: IOSS Default Output Title
1:           Dim = 3, Blocks = 1, Nodes = 64, Elements = 27, Nodesets = 4, Sidesets = 0
1:           Vars: Global = 0, Nodal = 11, Element = 160, Nodeset = 0, Sideset = 0, Times = 13
1: 
1:   FILE 2: /home/ikalash/nightlyCDash/repos/Albany/tests/small/LCM/RigidBody/RigidRotation.gold.exo
1:    Title: Sierra Output Default Title
1:           Dim = 3, Blocks = 1, Nodes = 64, Elements = 27, Nodesets = 4, Sidesets = 0
1:           Vars: Global = 0, Nodal = 11, Element = 160, Nodeset = 0, Sideset = 0, Times = 13
1: 
1:   COMMAND FILE: /home/ikalash/nightlyCDash/repos/Albany/tests/small/LCM/RigidBody/RigidRotation.exodiff
1: 
1: exodiff: DIFFERENCE .. Specified element variable "Cauchy_Stress_01" is not in the first file.
1: 
1: exodiff: DIFFERENCE .. Specified element variable "Cauchy_Stress_02" is not in the first file.
1: 

The first file is the file generated by the run. It looks like element variables like the Cauchy Stresses are no longer being written to this file, whereas they used to be.

Now, it would seem that we changed something in Albany to cause this behavior. However, if I pull this Trilinos (that just happened to be the one I was using when initially diagnosing the problems; I am pretty confident if I used Trilinos of the day I'd get the same behavior) 9bf901f0732da8c6f3629911986bd97b0c48dd9f with the following revert:

 Revert "Merge Pull Request #2930 from prwolfe/Trilinos/import_stk"

    This reverts commit 1d835e3595226bebef56c28862e4e8425af9431e, reversing
    changes made to 22c79ca9434ce101fc4c4200a90c5d9e42692ca5.

and use Albany master from today, the test passes (so it has all the relevant element variables).

It seems the PR #2930 somehow implicitly turned off writing element variables to the mesh - is that possible? Maybe they were written before by default and that default behavior has changed?

alanw0 commented 6 years ago

@ikalash ok thanks for the detailed info. I'm surprised that the stk PR turned off an writing element variable, but I can't rule it out. I'll use -DENABLE_LCM=ON and work on reproducing this issue asap.

ikalash commented 6 years ago

@alanw0 : thanks. I can also see what happens with a Trilinos of the day with the revert of PR #2930 (though there might be conflicts with your most recent PR). Let me know if this would be useful.

alanw0 commented 6 years ago

@ikalash No need for you to do anything else right now. I'm doing the LCM=ON Albany build (it's taking a little effort because that required some extra things enabled in my trilinos config...). We hope to have this debugged (and fixed) today. I'll keep you posted.

ikalash commented 6 years ago

Sounds great, thanks!

alanw0 commented 6 years ago

@ikalash We looked at the RigidBody test. The result file contains all of the fields, but their names have changed. This is due to some "improvements" in the way stk-io interprets field types. The gold file contains Cauchy_Stress_01 .. Cauchy_Stress_72 (i.e., a 1D vector field of length 72), while the new result file contains Cauchy_Stress_1_01 .. Cauchy_Stress_3_24. (i.e., a 2D field which is 3 vec-24 fields). This is influenced by how you declare the stk-field, and how you call stk::mesh::put_field.

It looks like Albany declares Cauchy_Stress as stk::mesh::Field which expands out to stk::mesh::Field<Cartesian3D,Cartesian3D,Cartesian3D>. Cartesian3D is badly named. It basically just means vector. and in your case, Cauchy_Stress has the 3 vector-lengths as 3,3,8. Those three lengths are passed to stk::mesh::put_field. The order in which you pass them influences how stk-io will create the names Cauchy_Stress_3_24 etc. I think the current limit on number-of-dimensions for stk-io fields is 2, which is why the second dimension is 24 == 3x8.

So there are a couple of options:

  1. You can rebless the test (replace the gold file). The contents of the results file appear correct, subject to the renamed fields.
  2. You can change the way you declare the Cauchy_Stress field. If you really want it to be a 1D vec-72, you could change it to stk::mesh::Field<double,Cartesian3D> and call put_field(.., 72).
  3. Does the field actually represent tensors at gauss-points or something like that? You could declare the field as stk::mesh::Field<double, 8, FullTensor33> which would possibly give a better field layout in the results file but would probably also change the way the field is indexed.

Sorry for this inconvenience. (And sorry for the implicit API/behavior change.) These changes came about because some users had new kinds of fancy field types, but exodus really only represents scalar fields (which is why displacements get stored as displ_x, displ_y, displ_z, etc.). So stk-io does some gymnastics to force fields into exodus, and those manipulations have changed a little to try to represent things a little more richly.

Let us know if you have questions or need more information. Once again, sorry for the ordeal!

ikalash commented 6 years ago

@alanw0 : thanks for looking into this! Yup, this convention change would cause the failures we are seeing.

Regarding your question 3.: yes, exactly, each of the stress variables is the value of the variable at a Gauss point (which is why there are so many). I'm not sure why what you suggest was not done before - there may have been a reason (tagging the people who would know below).

Tagging @jwfoulk and @lxmota , the main developers of these tests - Jay, Alejandro, how would you prefer to proceed regarding fixing the failing LCM exodiff tests (see above)? I think changing the gold files is probably too much of a pain given that there are so many gold files. Option 2 seems like the quickest way to fix this. It might be best for one of you to do this since you know the stress field layouts better than me; or maybe we can look at it together. 3. seems like a better overall approach but may be more work - perhaps you have looked into this in the past and decided not to go this route for some reason?

lxmota commented 6 years ago

We need to discuss this more thoroughly, since it will affect many tests and people.

ikalash commented 6 years ago

@lxmota : agreed which is why I tagged you and Jay. Let me know if you want to meet up to discuss this, or you may be planning to bring it up in one of your regular group meetings.

lxmota commented 6 years ago

@ikalash Let's discuss it first among Albany devs, so I can have a better understanding of the available options to fix this. I'm not sure that I can bring this up to the wider LCM group before the end of the FY, so we'll take it from there.

ikalash commented 6 years ago

OK sounds good. I think we can probably go ahead and close this Trilinos issue and move the discussion to the relevant Albany issue (https://github.com/gahansen/Albany/issues/358), since @alanw0 has fixed the STK issues that were present and pointed to changes required on the Albany side. If you agree, I'll go ahead and close this issue.

lxmota commented 6 years ago

@ikalash Sounds good to me. Thank you for tracking this down.