lanl / singularity-eos

Performance portable equations of state and mixed cell closures
https://lanl.github.io/singularity-eos/
BSD 3-Clause "New" or "Revised" License
27 stars 8 forks source link

Fix Bug in PTE Solver and Several Code Quality Improvements #383

Closed jhp-lanl closed 3 months ago

jhp-lanl commented 4 months ago

PR Summary

This PR is dependent on https://github.com/lanl/singularity-eos/pull/382

This PR makes a few main changes:

PR Checklist

If preparing for a new release, in addition please check the following:

jhp-lanl commented 4 months ago

I'm still getting negative volume fractions despite the fact that I believe the code should be catching this. I'll add a unit test here shortly that should demonstrate this and then work on whatever fix is needed to prevent this.

jhp-lanl commented 3 months ago

@Yurlungur and @dholladay00 this is ready for review now.

One comment: my test is aborting due to a double free or corruption (out) and I'm not quite sure why. I tried removing the PORTABLE_FREE statements, but that didn't change anything.

The test assertion passes, but this abort causes the test to fail. Any guidance on how to debug this?

jhp-lanl commented 3 months ago

I'm somewhat concerned that the PTE solver needs to run whether or not something goes wrong, so I worry about about these PORTABLE_FAIL checks. I'd prefer warnings and returning a failure status.

Are you saying you'd prefer the PTE solver to fail at the point where an issue is reached rather than running and then returning a nonsense result?

Normally I'd agree with you, but in my experience of debugging the PTE solver, I found the indirection and indexing to be difficult to navigate in the debugger, hence creating the print statements for when the PTE solver fails. I found it much easier to extract the bad state, reproduce it in a unit test, and then debug from there rather than debugging inside the host code.

That said, maybe there's a middle ground? I could add a compile time option instead of just relying on debug mode. That way I could compile in release mode with the option to find the bad state and then compile in debug to actually find the point of failure in the PTE solve. What are your thoughts?

Yurlungur commented 3 months ago

I'm somewhat concerned that the PTE solver needs to run whether or not something goes wrong, so I worry about about these PORTABLE_FAIL checks. I'd prefer warnings and returning a failure status.

Are you saying you'd prefer the PTE solver to fail at the point where an issue is reached rather than running and then returning a nonsense result?

Actually the opposite. ;) I was hoping for the PTE solver to continue running even in debug mode. In case the failure state is something the host code needs to be robust for.

Normally I'd agree with you, but in my experience of debugging the PTE solver, I found the indirection and indexing to be difficult to navigate in the debugger, hence creating the print statements for when the PTE solver fails. I found it much easier to extract the bad state, reproduce it in a unit test, and then debug from there rather than debugging inside the host code.

That makes complete sense. I have to admit I'm concerned by how difficult this was to navigate.

That said, maybe there's a middle ground? I could add a compile time option instead of just relying on debug mode. That way I could compile in release mode with the option to find the bad state and then compile in debug to actually find the point of failure in the PTE solve. What are your thoughts?

That makes sense to me!

jhp-lanl commented 3 months ago

I'm somewhat concerned that the PTE solver needs to run whether or not something goes wrong, so I worry about about these PORTABLE_FAIL checks. I'd prefer warnings and returning a failure status.

Are you saying you'd prefer the PTE solver to fail at the point where an issue is reached rather than running and then returning a nonsense result?

Actually the opposite. ;) I was hoping for the PTE solver to continue running even in debug mode. In case the failure state is something the host code needs to be robust for.

Ah okay I think we're on the same page then. I'll limit the assert failures to malformed inputs and rely on the checks in get_sg*. If we want, we can probably adapt those checks in the future to be an extra layer on the outputs of the PTE solver itself. I'll make an issue for that.

jhp-lanl commented 3 months ago

I added a Finalize() for each EOS as well as freeing all the memory but I get this failure:

double free or corruption (!prev)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
closure_unit_tests is a Catch2 v3.0.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
Scenario: Density-Temperature PTE Solver
      Given: Equations of state
      Given: A difficult state for the density-temperature PTE solver
       Then: The PTE solver should converge
-------------------------------------------------------------------------------
/usr/projects/eap/users/jhp/XCAP/singularity-eos/github/singularity-eos/test/test_closure_pte.cpp:112
...............................................................................

/usr/projects/eap/users/jhp/XCAP/singularity-eos/github/singularity-eos/test/test_closure_pte.cpp:112: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGABRT - Abort (abnormal termination) signal

===============================================================================
test cases: 1 | 1 failed
assertions: 2 | 1 passed | 1 failed

Aborted

This is true whether the test is run in isolation or as part of the suite or as a standalone executable.

jhp-lanl commented 3 months ago

I think I figured out the issue. I forgot that PORTABLE_MALLOC has no concept of the type being allocated and needs the size of the type.

jhp-lanl commented 3 months ago

@dholladay00 I modified the tests to address your comments. I also broke out some code into functions so I can start a collection of PTE states that fail to converge. I have my next state in there already, but I'll back burner it for now and create an issue so we can get this merged in.

I'm worried that I might go down a rabbit hole of PTE issues if I keep running until I don't get any issues. I'll probably scale back the ambitions of my EAP MR and provide simpler tests of the functionality.

dholladay00 commented 3 months ago

@jhp-lanl we have gone back and forth on the portability of the memory, but now that I look more closely this test contains no device kernels, it is all host. Either we need to run the solve in a kernel (could do a length 1 portable for) or make the memory host only.

jhp-lanl commented 3 months ago

The PTE test should be running on device now. I'll check the gitlab CI here in a bit to make sure it's running correctly (might need somebody to run the Roci runners).

jhp-lanl commented 3 months ago

I think I got all the bugs worked out with running the test on device.

The test should be a good platform to add tests that used to break the PTE solver (or still do while we're figuring out a solution). And it can be a bit of a template for using the new ports-of-call features for allocating on device without having to mess with lots of Kokkos views and #IFDEF statements.

dholladay00 commented 3 months ago

great work @jhp-lanl! This not only fixes a long-standing bug in the solver, but it uses the latest best practices for portability and adds improvements to the configure/build system. In addition, this provides a tool to explore problematic PTE inputs separate from the context of a host code!

Assuming this passes CI and pending review by those that understand the physics better than I, this is ready! 👍

jhp-lanl commented 3 months ago

Assuming this passes CI and pending review by those that understand the physics better than I, this is ready! 👍

I think I'm still having some issues with the CI on gitlab unfortunately. I thought I had fixed these issues on my build when executing on device, but I'll keep working to reproduce

dholladay00 commented 3 months ago

Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.

@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.

Yurlungur commented 3 months ago

Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.

@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.

Yes understood. I believe @pdmullen at least plans to review Monday. We shouldn't wait for @jdolence .

dholladay00 commented 3 months ago

Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.

@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.

Yes understood. I believe @pdmullen at least plans to review Monday. We shouldn't wait for @jdolence .

@Yurlungur I'd really like @jdolence's eyes on it as some point for a sanity check if at all possible, even if it's after the merge. Sometimes the original author has knowledge that is hard to obtain by reading alone.

jhp-lanl commented 3 months ago

Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.

@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.

That said, I still need to fix the CI. I'll work on that this afternoon.

Yurlungur commented 3 months ago

@Yurlungur I'd really like @jdolence's eyes on it as some point for a sanity check if at all possible, even if it's after the merge. Sometimes the original author has knowledge that is hard to obtain by reading alone.

I agree. My concern is I don't know when I will be able to get him to look. We were visiting him in Ann Arbor this week and I asked him to look, though. So there's hope.

jhp-lanl commented 3 months ago

@Yurlungur I'd really like @jdolence's eyes on it as some point for a sanity check if at all possible, even if it's after the merge. Sometimes the original author has knowledge that is hard to obtain by reading alone.

I agree. My concern is I don't know when I will be able to get him to look. We were visiting him in Ann Arbor this week and I asked him to look, though. So there's hope.

I think his input will be valuable whenever we receive it.

jhp-lanl commented 3 months ago

That said, I still need to fix the CI. I'll work on that this afternoon.

CI Fixed! Array was out of bounds, and I just bit the bullet and enabled proxy variables for all my terminal sessions rather than limiting them to interactive ones.

jhp-lanl commented 3 months ago

I think this is ready to go now

Yurlungur commented 3 months ago

:+1: Merge when ready

jhp-lanl commented 3 months ago

Merging... @dholladay00 I think we're ready for a release