Fix Bug in PTE Solver and Several Code Quality Improvements

jhp-lanl commented 4 months ago

PR Summary

This PR is dependent on https://github.com/lanl/singularity-eos/pull/382

This PR makes a few main changes:

The previous ScaleDx algorithm was written assuming that only one guard (small volume fraction, small temperature, minimum density, etc.) would activate at once. I believe I'm running into a test case where multiple guards are activating so they end up overwriting each other. Actually, since the newly-shrunk scale is used in each successive if block, they have the effect of only decreasing scale. I'll add a comment to this effect.
Materials that didn't participate in the PTE solver had their energies set to zero with the volumes untouched. This PR changes this so that non-participating materials are assumed to be at the cell density, which determines their densities and energies. This also slightly changes the volume available for the PTE materials, reflected in a different volume fraction sum.
I also added some comments to the line search algorithm and simplified the logic slightly by removing what appeared to be a superfluous else block.
Fixed two bugs in the the rho-T PTE solver ScaleDx function. First, there were missing parentheses. Second, the logic was incorrect when the volume fraction of a material has already exceeded the imposed maximum. This maximum occurs at the minimum pressure on the EOS, effectively guarding against the EOS entering unphysical (thermodynamically unstable) states. However, it's possible for this maximum to change as the temperature guess changes such that an old volume fraction is above this maximum. In that case, the logic of the function was increasing the step to get back into the physical regime, but possibly at the cost of violating other checks.
Adds a test for a difficult PTE state. It was failing to converge but now succeeds.
Improves descriptiveness of assert statements to identify the function they are in. This isn't strictly necessary since the line number is given, but it makes debugging a bit faster.
The TestUpdate() function has been changed so that it's a bit more clear when it's also updating the state versus when it's supposed to cache the previous state.

PR Checklist

[x] Adds a test for any bugs fixed. Adds tests for new features.
[x] Format your changes by using the make format command after configuring with cmake.
[x] Document any new features, update documentation for changes made.
[x] Make sure the copyright notice on any files you modified is up to date.
[x] After creating a pull request, note it in the CHANGELOG.md file.
[ ] LANL employees: make sure tests pass both on the github CI and on the Darwin CI

If preparing for a new release, in addition please check the following:

N/A Update the version in cmake.
N/A Move the changes in the CHANGELOG.md file under a new header for the new release, and reset the categories.

jhp-lanl commented 4 months ago

I'm still getting negative volume fractions despite the fact that I believe the code should be catching this. I'll add a unit test here shortly that should demonstrate this and then work on whatever fix is needed to prevent this.

jhp-lanl commented 3 months ago

@Yurlungur and @dholladay00 this is ready for review now.

One comment: my test is aborting due to a double free or corruption (out) and I'm not quite sure why. I tried removing the PORTABLE_FREE statements, but that didn't change anything.

The test assertion passes, but this abort causes the test to fail. Any guidance on how to debug this?

jhp-lanl commented 3 months ago

I'm somewhat concerned that the PTE solver needs to run whether or not something goes wrong, so I worry about about these PORTABLE_FAIL checks. I'd prefer warnings and returning a failure status.

Are you saying you'd prefer the PTE solver to fail at the point where an issue is reached rather than running and then returning a nonsense result?

Normally I'd agree with you, but in my experience of debugging the PTE solver, I found the indirection and indexing to be difficult to navigate in the debugger, hence creating the print statements for when the PTE solver fails. I found it much easier to extract the bad state, reproduce it in a unit test, and then debug from there rather than debugging inside the host code.

That said, maybe there's a middle ground? I could add a compile time option instead of just relying on debug mode. That way I could compile in release mode with the option to find the bad state and then compile in debug to actually find the point of failure in the PTE solve. What are your thoughts?

Yurlungur commented 3 months ago

I'm somewhat concerned that the PTE solver needs to run whether or not something goes wrong, so I worry about about these PORTABLE_FAIL checks. I'd prefer warnings and returning a failure status.

Are you saying you'd prefer the PTE solver to fail at the point where an issue is reached rather than running and then returning a nonsense result?

Actually the opposite. ;) I was hoping for the PTE solver to continue running even in debug mode. In case the failure state is something the host code needs to be robust for.

Normally I'd agree with you, but in my experience of debugging the PTE solver, I found the indirection and indexing to be difficult to navigate in the debugger, hence creating the print statements for when the PTE solver fails. I found it much easier to extract the bad state, reproduce it in a unit test, and then debug from there rather than debugging inside the host code.

That makes complete sense. I have to admit I'm concerned by how difficult this was to navigate.

That said, maybe there's a middle ground? I could add a compile time option instead of just relying on debug mode. That way I could compile in release mode with the option to find the bad state and then compile in debug to actually find the point of failure in the PTE solve. What are your thoughts?

That makes sense to me!

jhp-lanl commented 3 months ago

I'm somewhat concerned that the PTE solver needs to run whether or not something goes wrong, so I worry about about these PORTABLE_FAIL checks. I'd prefer warnings and returning a failure status.

Are you saying you'd prefer the PTE solver to fail at the point where an issue is reached rather than running and then returning a nonsense result?

Actually the opposite. ;) I was hoping for the PTE solver to continue running even in debug mode. In case the failure state is something the host code needs to be robust for.

Ah okay I think we're on the same page then. I'll limit the assert failures to malformed inputs and rely on the checks in get_sg*. If we want, we can probably adapt those checks in the future to be an extra layer on the outputs of the PTE solver itself. I'll make an issue for that.

jhp-lanl commented 3 months ago

I added a Finalize() for each EOS as well as freeing all the memory but I get this failure:

double free or corruption (!prev)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
closure_unit_tests is a Catch2 v3.0.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
Scenario: Density-Temperature PTE Solver
      Given: Equations of state
      Given: A difficult state for the density-temperature PTE solver
       Then: The PTE solver should converge
-------------------------------------------------------------------------------
/usr/projects/eap/users/jhp/XCAP/singularity-eos/github/singularity-eos/test/test_closure_pte.cpp:112
...............................................................................

/usr/projects/eap/users/jhp/XCAP/singularity-eos/github/singularity-eos/test/test_closure_pte.cpp:112: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases: 1 | 1 failed
assertions: 2 | 1 passed | 1 failed

Aborted

This is true whether the test is run in isolation or as part of the suite or as a standalone executable.

jhp-lanl commented 3 months ago

I think I figured out the issue. I forgot that PORTABLE_MALLOC has no concept of the type being allocated and needs the size of the type.

jhp-lanl commented 3 months ago

@dholladay00 I modified the tests to address your comments. I also broke out some code into functions so I can start a collection of PTE states that fail to converge. I have my next state in there already, but I'll back burner it for now and create an issue so we can get this merged in.

I'm worried that I might go down a rabbit hole of PTE issues if I keep running until I don't get any issues. I'll probably scale back the ambitions of my EAP MR and provide simpler tests of the functionality.

dholladay00 commented 3 months ago

@jhp-lanl we have gone back and forth on the portability of the memory, but now that I look more closely this test contains no device kernels, it is all host. Either we need to run the solve in a kernel (could do a length 1 portable for) or make the memory host only.

jhp-lanl commented 3 months ago

The PTE test should be running on device now. I'll check the gitlab CI here in a bit to make sure it's running correctly (might need somebody to run the Roci runners).

jhp-lanl commented 3 months ago

I think I got all the bugs worked out with running the test on device.

The test should be a good platform to add tests that used to break the PTE solver (or still do while we're figuring out a solution). And it can be a bit of a template for using the new ports-of-call features for allocating on device without having to mess with lots of Kokkos views and #IFDEF statements.

dholladay00 commented 3 months ago

great work @jhp-lanl! This not only fixes a long-standing bug in the solver, but it uses the latest best practices for portability and adds improvements to the configure/build system. In addition, this provides a tool to explore problematic PTE inputs separate from the context of a host code!

Assuming this passes CI and pending review by those that understand the physics better than I, this is ready! 👍

jhp-lanl commented 3 months ago

Assuming this passes CI and pending review by those that understand the physics better than I, this is ready! 👍

I think I'm still having some issues with the CI on gitlab unfortunately. I thought I had fixed these issues on my build when executing on device, but I'll keep working to reproduce

dholladay00 commented 3 months ago

Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.

@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.

Yurlungur commented 3 months ago

Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.

@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.

Yes understood. I believe @pdmullen at least plans to review Monday. We shouldn't wait for @jdolence .

dholladay00 commented 3 months ago

Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.

@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.

Yes understood. I believe @pdmullen at least plans to review Monday. We shouldn't wait for @jdolence .

@Yurlungur I'd really like @jdolence's eyes on it as some point for a sanity check if at all possible, even if it's after the merge. Sometimes the original author has knowledge that is hard to obtain by reading alone.

jhp-lanl commented 3 months ago

Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.

@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.

That said, I still need to fix the CI. I'll work on that this afternoon.

Yurlungur commented 3 months ago

@Yurlungur I'd really like @jdolence's eyes on it as some point for a sanity check if at all possible, even if it's after the merge. Sometimes the original author has knowledge that is hard to obtain by reading alone.

I agree. My concern is I don't know when I will be able to get him to look. We were visiting him in Ann Arbor this week and I asked him to look, though. So there's hope.

jhp-lanl commented 3 months ago

@Yurlungur I'd really like @jdolence's eyes on it as some point for a sanity check if at all possible, even if it's after the merge. Sometimes the original author has knowledge that is hard to obtain by reading alone.

I agree. My concern is I don't know when I will be able to get him to look. We were visiting him in Ann Arbor this week and I asked him to look, though. So there's hope.

I think his input will be valuable whenever we receive it.

jhp-lanl commented 3 months ago

That said, I still need to fix the CI. I'll work on that this afternoon.

CI Fixed! Array was out of bounds, and I just bit the bullet and enabled proxy variables for all my terminal sessions rather than limiting them to interactive ones.

jhp-lanl commented 3 months ago

I think this is ready to go now

Yurlungur commented 3 months ago

:+1: Merge when ready

jhp-lanl commented 3 months ago

Merging... @dholladay00 I think we're ready for a release

lanl / singularity-eos

Fix Bug in PTE Solver and Several Code Quality Improvements #383

PR Summary

PR Checklist