Closed jhp-lanl closed 3 months ago
I'm still getting negative volume fractions despite the fact that I believe the code should be catching this. I'll add a unit test here shortly that should demonstrate this and then work on whatever fix is needed to prevent this.
@Yurlungur and @dholladay00 this is ready for review now.
One comment: my test is aborting due to a double free or corruption (out)
and I'm not quite sure why. I tried removing the PORTABLE_FREE
statements, but that didn't change anything.
The test assertion passes, but this abort causes the test to fail. Any guidance on how to debug this?
I'm somewhat concerned that the PTE solver needs to run whether or not something goes wrong, so I worry about about these PORTABLE_FAIL checks. I'd prefer warnings and returning a failure status.
Are you saying you'd prefer the PTE solver to fail at the point where an issue is reached rather than running and then returning a nonsense result?
Normally I'd agree with you, but in my experience of debugging the PTE solver, I found the indirection and indexing to be difficult to navigate in the debugger, hence creating the print statements for when the PTE solver fails. I found it much easier to extract the bad state, reproduce it in a unit test, and then debug from there rather than debugging inside the host code.
That said, maybe there's a middle ground? I could add a compile time option instead of just relying on debug mode. That way I could compile in release mode with the option to find the bad state and then compile in debug to actually find the point of failure in the PTE solve. What are your thoughts?
I'm somewhat concerned that the PTE solver needs to run whether or not something goes wrong, so I worry about about these PORTABLE_FAIL checks. I'd prefer warnings and returning a failure status.
Are you saying you'd prefer the PTE solver to fail at the point where an issue is reached rather than running and then returning a nonsense result?
Actually the opposite. ;) I was hoping for the PTE solver to continue running even in debug mode. In case the failure state is something the host code needs to be robust for.
Normally I'd agree with you, but in my experience of debugging the PTE solver, I found the indirection and indexing to be difficult to navigate in the debugger, hence creating the print statements for when the PTE solver fails. I found it much easier to extract the bad state, reproduce it in a unit test, and then debug from there rather than debugging inside the host code.
That makes complete sense. I have to admit I'm concerned by how difficult this was to navigate.
That said, maybe there's a middle ground? I could add a compile time option instead of just relying on debug mode. That way I could compile in release mode with the option to find the bad state and then compile in debug to actually find the point of failure in the PTE solve. What are your thoughts?
That makes sense to me!
I'm somewhat concerned that the PTE solver needs to run whether or not something goes wrong, so I worry about about these PORTABLE_FAIL checks. I'd prefer warnings and returning a failure status.
Are you saying you'd prefer the PTE solver to fail at the point where an issue is reached rather than running and then returning a nonsense result?
Actually the opposite. ;) I was hoping for the PTE solver to continue running even in debug mode. In case the failure state is something the host code needs to be robust for.
Ah okay I think we're on the same page then. I'll limit the assert failures to malformed inputs and rely on the checks in get_sg*
. If we want, we can probably adapt those checks in the future to be an extra layer on the outputs of the PTE solver itself. I'll make an issue for that.
I added a Finalize()
for each EOS as well as freeing all the memory but I get this failure:
double free or corruption (!prev)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
closure_unit_tests is a Catch2 v3.0.1 host application.
Run with -? for options
-------------------------------------------------------------------------------
Scenario: Density-Temperature PTE Solver
Given: Equations of state
Given: A difficult state for the density-temperature PTE solver
Then: The PTE solver should converge
-------------------------------------------------------------------------------
/usr/projects/eap/users/jhp/XCAP/singularity-eos/github/singularity-eos/test/test_closure_pte.cpp:112
...............................................................................
/usr/projects/eap/users/jhp/XCAP/singularity-eos/github/singularity-eos/test/test_closure_pte.cpp:112: FAILED:
{Unknown expression after the reported line}
due to a fatal error condition:
SIGABRT - Abort (abnormal termination) signal
===============================================================================
test cases: 1 | 1 failed
assertions: 2 | 1 passed | 1 failed
Aborted
This is true whether the test is run in isolation or as part of the suite or as a standalone executable.
I think I figured out the issue. I forgot that PORTABLE_MALLOC
has no concept of the type being allocated and needs the size of the type.
@dholladay00 I modified the tests to address your comments. I also broke out some code into functions so I can start a collection of PTE states that fail to converge. I have my next state in there already, but I'll back burner it for now and create an issue so we can get this merged in.
I'm worried that I might go down a rabbit hole of PTE issues if I keep running until I don't get any issues. I'll probably scale back the ambitions of my EAP MR and provide simpler tests of the functionality.
@jhp-lanl we have gone back and forth on the portability of the memory, but now that I look more closely this test contains no device kernels, it is all host. Either we need to run the solve in a kernel (could do a length 1 portable for) or make the memory host only.
The PTE test should be running on device now. I'll check the gitlab CI here in a bit to make sure it's running correctly (might need somebody to run the Roci runners).
I think I got all the bugs worked out with running the test on device.
The test should be a good platform to add tests that used to break the PTE solver (or still do while we're figuring out a solution). And it can be a bit of a template for using the new ports-of-call
features for allocating on device without having to mess with lots of Kokkos views and #IFDEF
statements.
great work @jhp-lanl! This not only fixes a long-standing bug in the solver, but it uses the latest best practices for portability and adds improvements to the configure/build system. In addition, this provides a tool to explore problematic PTE inputs separate from the context of a host code!
Assuming this passes CI and pending review by those that understand the physics better than I, this is ready! 👍
Assuming this passes CI and pending review by those that understand the physics better than I, this is ready! 👍
I think I'm still having some issues with the CI on gitlab unfortunately. I thought I had fixed these issues on my build when executing on device, but I'll keep working to reproduce
Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.
@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.
Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.
@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.
Yes understood. I believe @pdmullen at least plans to review Monday. We shouldn't wait for @jdolence .
Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.
@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.
Yes understood. I believe @pdmullen at least plans to review Monday. We shouldn't wait for @jdolence .
@Yurlungur I'd really like @jdolence's eyes on it as some point for a sanity check if at all possible, even if it's after the merge. Sometimes the original author has knowledge that is hard to obtain by reading alone.
Thanks for this, @jhp-lanl having unit tests here is a big improvement. As is the nasty bugs you caught. I still want @pdmullen and @jdolence 's eyes on this. But I'm happy with this. I also ran this branch through riot and ran some strenuous test problems, which still look good.
@Yurlungur: Jeff (@jhp-lanl) and I discussed the timeline for this PR. Given the timeline for a release into EAP, we are planning on merging this no later than EOD Monday and beginning the release process immediately following. This gives some, but not much time for review but we really can't wait any longer than that. If fixes are needed beyond what is caught on Monday, they will need to go in after the release.
That said, I still need to fix the CI. I'll work on that this afternoon.
@Yurlungur I'd really like @jdolence's eyes on it as some point for a sanity check if at all possible, even if it's after the merge. Sometimes the original author has knowledge that is hard to obtain by reading alone.
I agree. My concern is I don't know when I will be able to get him to look. We were visiting him in Ann Arbor this week and I asked him to look, though. So there's hope.
@Yurlungur I'd really like @jdolence's eyes on it as some point for a sanity check if at all possible, even if it's after the merge. Sometimes the original author has knowledge that is hard to obtain by reading alone.
I agree. My concern is I don't know when I will be able to get him to look. We were visiting him in Ann Arbor this week and I asked him to look, though. So there's hope.
I think his input will be valuable whenever we receive it.
That said, I still need to fix the CI. I'll work on that this afternoon.
CI Fixed! Array was out of bounds, and I just bit the bullet and enabled proxy variables for all my terminal sessions rather than limiting them to interactive ones.
I think this is ready to go now
:+1: Merge when ready
Merging... @dholladay00 I think we're ready for a release
PR Summary
This PR is dependent on https://github.com/lanl/singularity-eos/pull/382
This PR makes a few main changes:
The previousActually, since the newly-shrunkScaleDx
algorithm was written assuming that only one guard (small volume fraction, small temperature, minimum density, etc.) would activate at once. I believe I'm running into a test case where multiple guards are activating so they end up overwriting each other.scale
is used in each successiveif
block, they have the effect of only decreasing scale. I'll add a comment to this effect.else
block.ScaleDx
function. First, there were missing parentheses. Second, the logic was incorrect when the volume fraction of a material has already exceeded the imposed maximum. This maximum occurs at the minimum pressure on the EOS, effectively guarding against the EOS entering unphysical (thermodynamically unstable) states. However, it's possible for this maximum to change as the temperature guess changes such that an old volume fraction is above this maximum. In that case, the logic of the function was increasing the step to get back into the physical regime, but possibly at the cost of violating other checks.TestUpdate()
function has been changed so that it's a bit more clear when it's also updating the state versus when it's supposed to cache the previous state.PR Checklist
make format
command after configuring withcmake
.If preparing for a new release, in addition please check the following: