New LandIce failures in nightlies

ikalash commented 4 years ago

Looks like starting yesterday a bunch of the LandIce FO_GIS nightlies began failing with this error in the CEE gcc builds:

[GenericSTKMeshStruct] Processing field requirements...

p=0: *** Caught standard std::exception of type 'N7Teuchos10Exceptions21InvalidParameterValueE' :

 /projects/albany/nightlyAlbanyCDash/repos/Albany/src/disc/stk/Albany_GenericSTKMeshStruct.cpp:1524:

 Throw number = 1

 Throw test that evaluated to true: numLayers != normalizedLayersCoords.size()

 Error in GenericSTKMeshStruct: Number of layers in file ../AsciiMeshes/GisUnstructFiles/temperature.ascii (11) is different from the number expected (0). To fix this, please specify the correct layered data dimension when you register the state.

  - Reading Node Scalar field 'basal_friction' from file '../AsciiMeshes/GisUnstructFiles/basal_friction_log.ascii' ... done!
  - Reading Node Scalar field 'ice_thickness' from file '../AsciiMeshes/GisUnstructFiles/thickness.ascii' ... done!
  - Reading Node Scalar field 'surface_height' from file '../AsciiMeshes/GisUnstructFiles/surface_height.ascii' ... done!
  - Reading Node Layered Scalar field 'temperature' from file '../AsciiMeshes/GisUnstructFiles/temperature.ascii' ...

https://sems-cdash-son.sandia.gov/cdash/testDetails.php?test=197674&build=2218 https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=2218

Does the number of layers in the temperature file need to be specified? It's curious that this is only happening in the CEE gcc builds, not the others.

bartgol commented 4 years ago

The fact that it says the expected number of layers is 0 suggests the temperature file is ok, but something is wrong with the state registration.

In particular, the normalized layers coordinates are not correctly sized in the stk field container. That's odd cause

the size is taken as the last dimension in the StateInfoStruct dim member
that dimension is read straight from file, without providing a default

Is by any chance the field `Number Of Layers' equal to 0? I checked all GIS input files, and they all have the correct number (11).

One thing to do (probably a good idea anyways), is to stick in a dimension check when we store the normalized layers coordinates in the stk field container. At line 213 in Albany_GenericSTKFiedlContainer_Def.hpp can you try puttin something like

TEUCHOS_TEST_FOR_EXCEPTION (dim.back()<=0, std::logic_error, "Error! Invalid layer dimension for state " + st.name + ".\n");

In those failing test, this should throw. Won't solve the problem, but will make the error pop up sooner. Easier to track, hopefully. E.g., one can fire up gdb, and inspect the st structure, and see if something is amiss.

Btw, I am not vpn-ed in now, so I can't see the cdash. What tests are failing?

ikalash commented 4 years ago

@gahansen : could I ask you to have a look at this issue? It only happens on cee with the new gcc compiler you built. I checked some older versions of the code and the issue still shows up, which is somewhat puzzling. Maybe it's due to some of the settings in that build which we were tweaking (changing lib paths for instance)?

gahansen commented 4 years ago

Glad to Irina! I'll start on this tomorrow and post an update once I have some info.

ikalash commented 4 years ago

That's great, thanks a lot, Glen! Sounds like a plan!

gahansen commented 4 years ago

@ikalash, I was able to duplicate this error. I'm also thinking it is something with the build or a TPL version change. I'll do a process of elimination and figure out what is awry here. I agree, it is interesting that it only happens with the gcc build and not the two clang builds we have access to.

ikalash commented 4 years ago

Thanks Glen! Please keep me posted.

gahansen commented 4 years ago

Hi Irina, I have some results on this issue. It appears that gcc-9.1.0 is not generating the correct code for this operation that Luca describes above. I also tried the new gcc-10.1.0, and then fell back to gcc-7.0.2. The two newer compilers both show this failure, but the older one does not. I am going to look at that section of the code and try and figure out what is happening and how to avoid it.

ikalash commented 4 years ago

Thanks for looking into this @gahansen ! It's curious that the issue doesn't show up in some of the other nightlies - my camobap and mockba nightlies use gcc 9 for instance. Please keep me posted - I am interested in what you learn as you dig in more.

gahansen commented 4 years ago

That is interesting information - I might look at the install a little more then. What version of MPI do you use on camobap and mockba?

ikalash commented 4 years ago

@gahansen : camobap is down right now getting its hard drivers replaced and its OS upgraded, so I cannot check. mockba uses gcc/9.2.0 and openmpi/1.10.1 . I can tell you camobap was using gcc/9.3.0 before it went down. I guess one difference is you use 9.1.0, but I'd be surprised that it is the cause. This is why I thought the TPLs may be the problem.

ikalash commented 4 years ago

Also, the gcc failures started on 5/27, if you look at the CDash site. The last successful built was already using gcc/9.1.0 which makes this more puzzling: https://sems-cdash-son.sandia.gov/cdash/viewConfigure.php?buildid=2108 . So what happened on 5/27? I tried pulling Trilinos and Albany from before 5/27 and got the same error so I don't think it's a code change, but may be worth double checking.

gahansen commented 4 years ago

@ikalash - I just put in a commit that should resolve this issue. There is a lengthy commit message that sums the issue and commit up, so I won't repeat it here other than note that the GCC issue boiled down to inconsistencies between shared libraries in the GCC TPLs on CEE. I made some changes to my example builds that you will want to capture in the nightlies, but that - along with cleanup I did in our environment on CEE, should resolve this issue.

ikalash commented 4 years ago

Thanks @gahansen ! That's very interesting about the TPL inconsistencies. I'll close this issue tomorrow once the nightlies run clean.

mperego commented 4 years ago

@gahansen thanks for figuring this out. I think we should rely on sems modules though for "official" Albany Land Ice nightlies, and stick as much as possible to compilers/environment tested by Trilinos.

ikalash commented 4 years ago

@gahansen : I'm still seeing failures in the nightlies

https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=2757 https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=2758

Do any changes need to be made to the nightly test scripts? I was under the impression all the changes were to the TPLs.

ikalash commented 4 years ago

@mperego : we could switch to the sems modules - at least for gcc, it would be easy, as I have a gcc build using the sems modules on my machines camobap. I inherited the CEE builds from Glen, so Glen would be the best one to speak about why sems modules were not used in the first place. It could be that they weren't around at the time the nightlies were set up. I did not switch them initially for two main reasons: (1) at the time I inherited, I wasn't that familiar with the nightly test scripts and how they work (now I am), (2) I didn't know if there was a reason they were using the modules that they were (originally not sems but sierra/sparc modules), and (3) I did not have the time to figure out the builds with the sems modules.

ikalash commented 4 years ago

I should add, I don't think it matters much whether sems modules are used, or not, as long as we have compilers that are tested by Trilinos. Of course if we choose to build all the TPLs, there may be issues on our side (as in this case).

Also, with regard to using newer compilers, I can tell you from recent experience with upgrading my Fedora workstation (and Alejandro's Fedora machines) to v32, which uses gcc 10 and clang 10: those compilers are much more stringent and caught a couple of real problems in Albany and CISM that we had to fix. They were fairly innocuous bugs, which is why they did not show up with the older compilers or in the runs we were doing, but they were bugs nonetheless. I am not recommending that we be ahead of Trilinos in the nightlies, just commenting.

gahansen commented 4 years ago

Wow, this is a crazy problem. I'd have no reservations in switching to the sems scripts, but after battling this challenge for a while I'm not sure that we would not see the problem there also on this combination of hardware and software. We did our own builds back in the days before SEMS existed, that is the main reason things are currently the way they are. However, due to our unique use of STK, breadth of applications, and the LCM project's unique needs it has been historically helpful to have our own builds. As the project has changed, it is not clear that this advantage is still at play. Unless we get the same issue when we switch to the SEMS modules on these machines.

Clearly, I'm still short on understanding what is at play here. I'd suggest that I grab the current nightly scripts and continue debugging this issue. It is encouraging that I am not seeing this in my own personal build so it is just a matter of time to figure out what is going on in the nightlies themselves. I think it would serve us well to fully figure this out in case it crops up again down the road. Our TPLs are a little ahead of what sems is using, I suspect we will see this crop up again when they upgrade their modules unless we get to the bottom of it.

Another point to consider - this might be a hint that we should examine the function (and delicacy) of what we are doing down in the GenericFieldContainer. I've not looked at this critically - I've focused on the builds because this issue appeared to crop up in an isolated area when we updated the builds there. This issue might be suggesting that i should take a closer look at that code and see if something might be amiss in there.

ikalash commented 4 years ago

@gahansen : so when you updated the TPLs , I presume you created your own build on top of them of Albany and tested it. Did the tests pass? If they did, there is something wrong - the nightlies should reproduce the same behavior as an analogous built done from scratch using regular configure scripts.

I don't think the SEMS modules will have this problem, since I use them on my RHEL7 machine w/o any issues. I can give it a try.

I'm all for looking at GenericFieldContainer to try to understand what's going on, if you're willing to dig into STK (something most of us try to avoid in general, I think).

gahansen commented 4 years ago

@ikalash This sounds like a plan. I suspect the nightly is also picking up some older library from somewhere like my builds were. I didn't isolate which one it was - my current theory was that there was some change in the pnetcdf, netcdf, hdf5 group that caused this - but I'm still not sure why it only shows up with newer gcc compiles and not newer clang builds. It might even be some old RHEL6 system library that didn't get removed when those systems were upgraded. I also tried the newer sierra compilers, those worked fine. There is a lot of complexity in this interplay of this software unfortunately, I didn't want to spend too long figuring that all out. I'll take a look at the nightly environment now and find out what is causing this. I'll also look at the STK code and code around this issue while I can reproduce this failure if I can.

mperego commented 4 years ago

I agree that there are some advantages in having our own libraries or using compilers that are different form those used by Trilinos for testing, and I understand the historical reasons why we are doing that. However, I think it is more sustainable, for the levels of funding we have, that we rely as much as possible on sems modules and to use Trilinos compilers. Using sems modules would save us some time in sorting out TPL library issues and it is also easier to reproduce errors because they are the same on several SNL machines. Using the same compilers used by Trilinos for testing would prevent us from having to fix Albany builds every time some change done in Trilinos is not compatible with our compilers. Of course there are instances where we have to use different environment and compilers (e.g. on cori or on some new GPU architecture we want to do research on).

As for the specific error in this issue, it makes sense to get to the bottom of it, but in general I think we should slowly start steering in the direction described above.

ikalash commented 4 years ago

@gahansen : can you please clarify one thing - are the tests failing in your build of Albany?

gahansen commented 4 years ago

I have two failing tests on the release build:

43:Thermal1D_with_Source 45:Thermal2D_with_Source 168:LANDICE_FO_MMS_perf 169:LANDICE_FO_MMS_perf_2

Tests 43 and 45 are failing with a seg fault right after creating the stepper and integrator in Tempus. 168 and 169 did not run as I've not put cee-compute024 in the file.

On the debug build I see:

137:FO_GIS_Gis20km_Tpetra 168:LANDICE_FO_MMS_perf 169:LANDICE_FO_MMS_perf_2

137 fails due to timeout (debug build runs too slow and hits the timer). Again, the performance tests did not run.

ikalash commented 4 years ago

I switched the nightlies to use the sems gcc 9.2.0 compiler, and the issue went away. Closing.

mperego commented 4 years ago

I switched the nightlies to use the sems gcc 9.2.0 compiler, and the issue went away. Closing.

Great! Irina, I forgot to mention that yesterday. Can we modify one of the nightly builds so that we only build LandIce? (I.e. with DemoPDEs off). There was an error with Thermal tests being enabled even when DemoPDEs were disabled. This would have been caught by nightly builds if we had disabled Demo PDEs. Thanks!

ikalash commented 4 years ago

@mperego : there actually is already a build like this, namely the Cori build. However, Albany tests are turned off in that build, which I think is why that Thermal issue slipped through the cracks. I could make one of the other existing builds an ALI-only build that runs tests, sure. I think I will do that on my RHEL7 workstation, mockba.

mperego commented 4 years ago

@mperego : there actually is already a build like this, namely the Cori build. However, Albany tests are turned off in that build, which I think is why that Thermal issue slipped through the cracks. I could make one of the other existing builds an ALI-only build that runs tests, sure. I think I will do that on my RHEL7 workstation, mockba.

Thanks! Yes, it was in fact a run-time error.

sandialabs / Albany

New LandIce failures in nightlies #597