sandialabs / Albany

Sandia National Laboratories' Albany multiphysics code
Other
282 stars 89 forks source link

Failing FO_Thermo_Wet_Bed on some platforms due to sensitivities #504

Closed ikalash closed 4 years ago

ikalash commented 5 years ago

The FO_Thermo_Wet_Bed test is failing in some of the nightlies:

http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4725617&build=86690 http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4726381&build=86696

 Sensitivities (0,0):

                     3.016072173124e+02                    0.000000000000e+00                    7.908825974575e+00                    3.377727008814e+01
 Sensitivity Test 0,0: 3.016072173124e+02 != 6.032144346249e+02 (rel 1.000000000000e-04 abs 1.000000000000e-08)

 CheckTestResults: Number of Comparisons Attempted = 8

 Main_Solve: MeanValue of final solution 8.355700637671e-01

 Number of Failed Comparisons: 1

It looks like some of the sensitivities are diffing. @mperego, @bartgol , can you please have a look?

mperego commented 5 years ago

@ikalash I'll look at that when I have some time. For some reason it works on my workstation. B.t.w. It seems that this run does not build an updated version of Albany http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlypassed&buildid=86709

ikalash commented 5 years ago

@mperego : thanks! I just fixed the skybridge build - thanks for calling it to my attention.

Yes, the failure does not occur everywhere. It happens in the AlbanyIntel and Albany64Bit (gcc) builds on CEE. It'll be interesting to see if it fails in the skybridge build tomorrow now that I fixed it.

ikalash commented 5 years ago

@mperego : do you know when you might have a chance to look at this? It is still occurring on some platforms, e.g. skybridge, blake, mayer.

mperego commented 5 years ago

Fixed by commit 3fb2ed12fd328ddaa59def1a115547b086e03396.

@ikalash It seems that the skybridge build does not update the repo correctly.

ikalash commented 5 years ago

Thanks for fixing it! Can you please open an issue on the skybridge issue? I'll look at it when I return. I feel like I've resolved this once before so I'm not sure what's going on.

ikalash commented 5 years ago

It looks like this issue is still there on Mayer - slipped through the cracks due to the other failures there: https://my.cdash.org/testDetails.php?test=49365108&build=1692039

Could you please have a look, @mperego ? I don't think the issue is that the code isn't getting updated correctly, as was the case on Mayer.

ikalash commented 5 years ago

Have you had a chance to look at this @mperego ? It's only happening on mayer now - not sure why.

mperego commented 5 years ago

Sorry I missed that. I'll look into it.

ikalash commented 5 years ago

No worries - thanks!

bartgol commented 5 years ago

@mperego I looked at your commit that changed the sensitivities test values, and I saw that the old value is precisely the value computed on mayer, while the new one is the one computed on all other platforms. By democracy rules, the new value you set is the right one, but I wonder if mayer gives different results due to some configuration option (in either albany or trilinos)?

bartgol commented 5 years ago

Update: I ran the test manually on mayer, and it passes. I am now recompiling with the trilinos installation used in the nightlies. If it still passes, I'll move on to the albany options inspection.

bartgol commented 5 years ago

Ah, using the trilinos installation used in the nightlies causes the test to fail. The question is: is it because of modules or trilinos options? I lean toward the latter. I compiled trilinos with OpenMP enabled, while the nightlies use Serial. I'm going to check if this is the culprit by compiling my trilinos with only serial.

If that's the case thoguh, we need to figure out where is the race condition...

bartgol commented 5 years ago

I can confirm that things are different depending on whether you enable OpenMP or Serial in trilinos.

Curiosuly enough though, the sensitivity test that fails for me is not the same: in the nightlies, we have a mismatch in the first sensitivity (6.032144346252e+02 != 3.016072359141e+02), while I get the first sensitivity correct (not bit-for-bit, but close enough to pass), and ge the last wrong (1.581765053419e+01 != 7.908789961152e+00). Again, the wrong value seems to be roughly twice as large as the correct one. I speculate we have an issue with locally replicated vector spaces in the responses (something related to Trilinos/trilinos#4663)

bartgol commented 5 years ago

I'm pretty confident this is related to replicated VS issues, since the code works with 1 rank. Unfortunately I don't have a fix for this. Since thyra+tpetra only offer one of the two

1) create locally replicated vector spaces: this preserves the "global" dimension of the vs, but performs a reduction every time you do a mat-vec. 2) create a distributed vector space: this does not perform a reduction when you do a mat-vec, but the global dimension of the vs is wrong (it's basically dim*numRanks), which can cause issues in the actual mat-vec.

mperego commented 5 years ago

@bartgol Oddly enough, I remember that all I had to do to have the sensitivity fixed and giving the value in the regression test on most of the machines, apparently not on Mayer, was to tighten the tolerances, but yes, I agree there is something fishy here.

bartgol commented 5 years ago

Even more odd: the test passes in debug mode. I am at a loss. I don't think the replicated vector space is an issue (or else we'd see it elsewhere too).

bartgol commented 5 years ago

Ok, here's my analysis of the test failure:

tl;dr do not use sensitivities of solution max/min values in regression tests.

The test generates slightly different numbers in the solution when built with OpenMP or Serial in Trilinos. The numbers are very very similar though. For instance, the solution max value of one of the equations is 1.35171933718234261 for OpenMP build vs 1.35171933718234616 for a serial build. That is, 14 decimal digits match, which means they are basically the same solution. How the differences are generated, is not really our concern, since it probably happens in Trilinos (perhaps some default values slightly different in the two cases, who knows). In any case, the numbers are "fine".

Where's the problem? The problem comes when computing sensitivities of certain responses. In particular, SolutionMaxValue is a response that sets dgdx to 1 at dof j if x[j]==max(x), and 0 otherwise. This is a problem, since slight variations in x may generate "very different" response sensitivities. This is indeed what happens here: in one build, we only get 1 dof where x==max(x), while in the other build, we get 2 such dofs (globally). This changes the 2 norm of dgdx from 1 to sqrt(2). When dgdx is then combined with the adjoint to recover dgdp, we get two completely different values. It's worth noticing that the "extra dof" that satisfies x==max(x) in the serial build is a "near hit" in the openmp build, meaning that in such build |x-max(x)| <1e-14, which explains why a slight change in numbers can trigger the extra hit.

So in short: it's ok to check the value of the max/min of the solution. But the sensitivity of that response may be drastically sensitive to roundoff differences (at least when computed via adjoints, it may be fine when computed as fwd sensitivities).

ikalash commented 4 years ago

Seems this issue has gone away, so closing.