sandialabs / Albany

Sandia National Laboratories' Albany multiphysics code
Other
282 stars 89 forks source link

corePDEs_SideSetLaplacian_3D failing on weaver after epetra removal #1030

Open jewatkins opened 8 months ago

jewatkins commented 8 months ago

https://sems-cdash-son.sandia.gov/cdash/test/4156291

It looks like weaver has more tests after https://github.com/sandialabs/Albany/pull/1028 was merged (111 -> 125) so maybe this case never ran before?

@mperego Is this a case that ran with epetra and never ran with tpetra? Were you planning to look into it? Anything obvious that shouldn't work on device? If not, @mcarlson801 can try to see what's going on.

mperego commented 8 months ago

@jewatkins. That's correct, this test was running with Epetra and never run with Tpetra. I briefly looked into it but I could not figure out the issue. If @mcarlson801 is willing to look into it, it would be great.

ikalash commented 4 months ago

This test is failing now in the OpenMP build as well: https://sems-cdash-son.sandia.gov/cdash//test/5973004 . It looks like the comparisons are failing as the response value computed is 0. The Cuda build is failing in the same way: https://sems-cdash-son.sandia.gov/cdash//test/5969262

mcarlson801 commented 4 months ago

I did a little bit of debugging on this and apparently the field (x_data) coming into GatherSolution and GatherSolution_Side is 0 everywhere. I'm not super familiar with how this problem is set up so I'm not sure what might cause this. @bartgol It sounds like you might be familiar with this test, any ideas how this could happen?

bartgol commented 4 months ago

That's odd. It should only happen at the first iteration, where the initial guess is 0. The NaN in the solver is likely preventing the solution from ever change, so it stays 0. I'm guessing the problem is that some entry of the Jac that should be 1 is actually kept at 0. Maybe something is amiss with the diagonal terms. I can dig quickly.

bartgol commented 4 months ago

It looks like this test passed in the OpenMP build last night. I just ran master on my workstation, and it works just fine. I don't see any relevant commit that went in yesterday, so I don't know what to make of this. As of now, CUDA is the only build that still shows the error.

The fact that cuda consistently fails may suggest an issue with row/col gids for the side equation. Perhaps the same diagonal entry is set twice (once to 0 and the other time to 1), and depending on the order in which the two happen it can end up with the right or wrong value. I don't have time to do an in-depth debug today, and I leave for vacation on Friday, so feel free to disable the test until I get back if you feel like it. When I get back, I can debug some more.

ikalash commented 4 months ago

@bartgol : thanks for looking into this. If you check the history of the test in the camobap OpenMP build, it seems it fails for awhile, then passes for awhile: https://sems-cdash-son.sandia.gov/cdash//test/5973004?graph=status . This suggests a heisenbug, which is disturbing. I would suspect the openmp issue is the same as the cuda one, so if the coda one is fixed, hopefully things will be good for openmp as well.

I'm fine with either keeping or disabling the test. I will make it a point not to open any more duplicate issues about this test :).