usnistgov / fipy

FiPy is a Finite Volume PDE solver written in Python
http://pages.nist.gov/fipy/en/latest
Other
506 stars 148 forks source link

Poor garbage collection with petsc4py 3.18.3 (was "Memory leak in `term.justErrorVector()`", but this isn't strictly a leak) #896

Closed Maxwellfire closed 1 year ago

Maxwellfire commented 1 year ago

I'm trying to use a slightly unorthodox solving method that requires me to get the error vector from my set of equations repeatedly. Each time term.justErrorVector() is called, it consumes some memory that is never freed.

Here is some simple code that reproduces the memory increase for me:

import fipy
mesh = fipy.Grid1D(nx=100, dx=1)
var1 = fipy.CellVariable(mesh=mesh, hasOld=True, value = 1.0)
eq1 = fipy.DiffusionTerm(var = var1, coeff=1.) == 0

for n in range(1,50000):
    eq1.justErrorVector()

This takes a couple of minutes to run and consumes about 5GiB on my system.

I was going to try the other solver implementations (pysparse and scipy), but pysparse won't run on python3 and scipy throws an error (which I can submit a bug report for separately.)

Maxwellfire commented 1 year ago

Interestingly, this does not appear to happen on my other machine, which has the same version of fipy but a different version of petsc4py/petsc. The working machine is using 3.17.4 real_he17b7f0_100 conda-forge and the failing version is using 3.18.2 real_hce8fbc3_100 conda-forge

guyer commented 1 year ago

Interestingly, this does not appear to happen on my other machine, which has the same version of fipy but a different version of petsc4py/petsc. The working machine is using 3.17.4 real_he17b7f0_100 conda-forge and the failing version is using 3.18.2 real_hce8fbc3_100 conda-forge

Given that, the leak is likely to be in that build of petsc and not something fipy has control over. What os are you running? Some variant of linux, I assume? What version of Python? FWIW, this does not leak on my MacBook with Python 3.10.

guyer commented 1 year ago

scipy throws an error (which I can submit a bug report for separately.)

If the error is RuntimeError: Factor is exactly singular, then this is because your equation admits an infinite number of solutions. Different solvers respond differently, but ultimately, the problem is not well-defined. Add a Dirichlet constraint or a TransientTerm.

With a Dirichlet condition, I do see a small monotonic increase in RAM usage with --scipy from ~54 MiB to ~57 MiB. I hesitate to call this a leak without further diagnosis; it could well be that the Python garbage collector just hasn't triggered, yet. With --petsc, I see a much smaller rise from ~54 MiB to ~54.5 MiB.

Either way, 5 GiB is an insane amount of RAM usage for this tiny problem. Do both petsc4py/petsc builds use this much RAM?

Maxwellfire commented 1 year ago

Given that, the leak is likely to be in that build of petsc and not something fipy has control over. What os are you running? Some variant of linux, I assume? What version of Python? FWIW, this does not leak on my MacBook with Python 3.10.

It does seem likely that the issue is in petsc. Part of the reason that I submitted a bug here is that I don't know the internals of how FiPy calls petsc4py well enough to submit a detailed bug report to them.

I've attached two files below with my environments that do and don't work:

environment_leak.txt environment_no_leak.txt

Both are using python 3.10.8, and were generated from the following environment.yml using conda, with different versions of petsc4py:

name: ionic

channels:
  - conda-forge
  - defaults

dependencies:
  - python
  - matplotlib
  - pandas
  - jupyterlab
  - bokeh<3.0
  - fipy
  - pytables
  - spyder
  - dotmap
  - petsc4py=3.18.2
 # - petsc4py=3.17.4

The version that works only consumes a couple of KiB of ram as it runs. The version that does not work rapidly consumes memory at around 1GiB per 30 seconds

I am running this on a kubuntu 22.10 system.

Maxwellfire commented 1 year ago

For the scipy solver, the issue was that I was setting os.environ['FIPY_SOLVERS'] = 'scipy', but then trying to use a hardcoded term.sweep(dt=dt, solver = fipy.solvers.petsc.linearGMRESSolver.LinearGMRESSolver()) call in part of my code. I guess that once you import fipy with a particular solver, you can't then use different solver. I assume this is expected behavior and not a bug?

If I don't specify the solver explicitly when I import, then switching between solvers on the fly seems to work.

guyer commented 1 year ago

I guess that once you import fipy with a particular solver, you can't then use different solver. I assume this is expected behavior and not a bug?

I think under some circumstances, it could work, but mixing solver packages is not intended usage.

guyer commented 1 year ago

It does seem likely that the issue is in petsc. Part of the reason that I submitted a bug here is that I don't know the internals of how FiPy calls petsc4py well enough to submit a detailed bug report to them.

I've attached two files below with my environments that do and don't work:

OK, thanks for this. If I can reproduce, I'll submit a ticket against petsc(4py) or, if I can't, I'll try to tell you how to create a minimally reproducible example that you can submit.

guyer commented 1 year ago

I can, indeed, reproduce, on both debian and macOS. Something is very wrong with that petsc(4py) build. I'll work on filing a ticket.

Maxwellfire commented 1 year ago

Thank you! I also just checked and the error reproduces with the latest build in conda-forge of 3.18.3

guyer commented 1 year ago

Running with mprof, shows leak which seemed very familiar.

I eventually remembered trilinos/Trilinos#2327 from five years ago. The same "fix" is not effective, however.

tkphd commented 1 year ago

Thanks for filing this ticket, @Maxwellfire! I encountered a memory leak with petsc4py 3.18 as well, but got lost in other things instead of tracking it down. I see the same leak from your MWE with petsc4py 3.18.3 (conda-forge), but no leaks with 3.16 or 3.17. This helps a lot!

guyer commented 1 year ago

While not a proper fix, this issue can be mitigated by adding a call to PETSc.garbage_cleanup() after every iteration. E.g.,

import fipy
from petsc4py import PETSc
mesh = fipy.Grid1D(nx=100, dx=1)
var1 = fipy.CellVariable(mesh=mesh, hasOld=True, value = 1.0)
eq1 = fipy.DiffusionTerm(var = var1, coeff=1.) == 0

for n in range(1,50000):
    eq1.justErrorVector()
    PETSc.garbage_cleanup()

I have identified some things that we should be cleaning up, but they don't completely resolve the leak (without manual garbage_cleanup()). I think this is not a bug in PETSc 3.18, although if it's an intentional change in behavior, I don't see anything obvious in the change log.

guyer commented 1 year ago

I realized I wasn't cleaning up the PETSc AO application-ordering object. Once I destroy that, too, I can get memory usage back to where it was with petsc4py 3.17.4, without calling PETSc.garbage_cleanup().

I have filed https://gitlab.com/petsc/petsc/-/issues/1309 against petsc4py, although I suspect they'll tell us to clean up after ourselves.