clear spec cache to avoid sporadic mismatch

smiet commented 3 weeks ago

Edited for improved explanation

This should fix the sporadic failure in the CI #357

The issue: f90wrap keeps a cache on the python side for the F90wrapped FORTRAN arrays, and the logic is as follows:

handle = get_specific_array_handle(stuff)  #pointer to the array
if handle in self._arrays #dict with arrays and handles:
  return self._arrays[handle]
else:
  array = get_array_in_complicated_f90wrapway(stuff)
  self._arrays[handle]=array
  return array

Now for the handle collision: If spec has run before, and the array has been accessed before (put in cache), and by happenstance in the second run, an array is placed in the same memory location as before, a handle collision occurs, and python assumes the array is identical to the one it accessed before (in shape and size too).

When python then tries to access this array, it can actually be a different array that just happens to be placed in the same location, or it is the right array, but the shape on python's side is not updated and we get an out of bounds error when trying to read outside the python-defined bounds.

The handle is basically the pointer to the array, cast as an int. It is unlikely to be exactly the same EXCEPT when a the code runs over and over again, and a conveniently shaped hole has just been emptied in memory by deallocation.

This also explains why the fault is mostly seen in CI, as they have much smaller memories, and are more likely to write to the same location. But this is handled by the CPU and not reproducible.

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.00%. Comparing base (9dd34c5) to head (7bc1e4c). Report is 6 commits behind head on master.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #431 +/- ## ======================================= Coverage 91.99% 92.00% ======================================= Files 75 75 Lines 13499 13504 +5 ======================================= + Hits 12419 12424 +5 Misses 1080 1080 ``` | [Flag](https://app.codecov.io/gh/hiddenSymmetries/simsopt/pull/431/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=hiddenSymmetries) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/hiddenSymmetries/simsopt/pull/431/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=hiddenSymmetries) | `92.00% <100.00%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=hiddenSymmetries#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

andrewgiuliani commented 3 weeks ago

I don't see how this is a race condition. We don't have multiple cores modifying the same area of memory here, do we?

Also why isn't the cache self._array being cleared between python runs? I would have thought that it would be empty at the start of a new run.

If the Fortran arrays in the cache aren't being freed won't this lead to a memory leak?

smiet commented 3 weeks ago

@andrewgiuliani you are right, this is not a race condition, excuse my ignorance.

F90wrap works in a funny way (I don't completely understand it), but the cache is only cleared when the python kernel quits. It sets up links to shared FORTRAN memory that persist until the kernel exits.

Normally you only run a single simulation, and the kernel quits when the optimisation has finished, it is only in testing that many many many SPEC instances are created and destroyed.

For example in

from simsopt.mhd import Spec
myspec1 = Spec('firstfile.sp')
myspec2 = Spec('differentfile.sp')

print(myspec1.allglobal.ext)
>>> differentfile

the Fortran arrays in myspec1 and myspec2 are the same.

This is a limitation of f90wrap and the above code should not be used. The best we can do is make sure that loading a new state runs correctly.

Yes this is a memory leak when many Spec objects are initialized in the same kernel session, but I don't see how to avoid it.

andrewgiuliani commented 3 weeks ago

When you clear the cache, is it possible to run a call on the FORTRAN side to deallocate the allocated arrays referenced in the cache?

If not, I would suggest that this anomalous behaviour should be mentioned in a docstring somewhere.

smiet commented 3 weeks ago

@andrewgiuliani This is not possible with the current implementation of f90wrap, and I am not sure if desired. It is meant to wrap existing programs, giving you access to the memory as they run, with the possibility of muting existing arrays. Forcing deallocation on the FORTRAN side from the python side would alter the program it is wrapping.

Python keeps a dictionary of arrays that have been referenced, but if a new array has the same handle, the dictionary is not updated, and referencing it out of the python-defined bounds (not FORTRAN) results in the error we see.

It is certainly a bug in the implementation that a handle collission (that is the right term!) is likely to occur, if arrays are deallocated and immediately afterwards allocated on memory-constrained systems, but this is at the core of some design choices made for f90wrap, and not something I feel comfortable fixing.

The current PR fixes the anomalous behavior that was occurring (that the cache kept accumulating), and this fix, though a bit draconian, prevents such behavior (and since the collision could occur for every accessed array, it is the right thing to do).

I will probably open an issue on the upstream f90wrap, but this will take a while as the whole numpy2.0 migration has left me very little time for science, and I have some QUASR configurations to calculate tangles in! ;)

edit: actually, this just overwrites the existing dictionary with an empty dictionary. I assume the Python garbage collector will delete it as there are no more references to it. Could someone with more python knowledge comment if this is the right thing to do (@mbkumar)?

andrewgiuliani commented 3 weeks ago

I would have thought that a deallocate all function could be written in on the Fortran side and called from python.

In any case I think that this is great that the bug has been identified!

mbkumar commented 2 weeks ago

@smiet , can we have a zoom call to understand what is going on? I am not getting the full picture from your description. Can't we call python garbage collector on the dictionary you are resetting?

mbkumar commented 2 weeks ago

At the least can you talk to the f90wrap developer, who might have encountered this issue before? There should be some mechanism in f90wrap to clear the cache.

smiet commented 2 weeks ago

@mbkumar yes, let's zoom tomorrow (Monday), was away from internet this weekend. Can you set it up? I'm having a little trouble with email on my phone right now, available all your morning

andrewgiuliani commented 2 weeks ago

I'd like to be present during the zoom too to better understand the problem

smiet commented 2 weeks ago

Some notes of the discussion resulted in issue #434.

All tests are passing consistently, this PR doesn't solve the root cause, but it applies a fix to stop a known bug. I suggest we move further discussion of this topic to #434 and merge this fix so we can stop seeing sporadic failures in our CI. @mbkumar @landreman @andrewgiuliani

hiddenSymmetries / simsopt

clear spec cache to avoid sporadic mismatch #431

Codecov Report