Memory growth - Githubissues

inducer commented 1 year ago

I understand that there is some type of memory growth occurring.

From the 2023-02-17 dev meeting notes, I gather that

that memory growth only occurs when using the memory pool
"Using jemalloc fixes the issue": Is that before or after turning off the pool.

Possibly related: #212.

cc @matthiasdiener

inducer commented 1 year ago

Is the growth reflected in

I.e. do those increase timestep-over-timestep?

If there is growth, can identify what bins in the memory pool are affected? Can you identify which allocations? Python makes it straightforward to attach stack traces to allocations.

Do we know if this growth is of "array" memory or "other" memory?

How do your findings change if you call free_held between steps?

What is the simplest driver that exhibits the growth? I gather from @lukeolson that, maybe, examples/wave-lazy.py may be affected. Could you please confirm? Is grudge/examples/wave/wave-min-mpi.py affected as well? Is, say, vortex-mpi.py affected? Grudge's Euler?

lukeolson commented 1 year ago

Also, it looks like set_trace is exposed so you could get some additional information from that: https://github.com/inducer/pyopencl/blob/main/src/mempool.hpp#L164

including bin size data

matthiasdiener commented 1 year ago

that memory growth only occurs when using the memory pool

The growth happens both with and without the pool. Here is an example with drivers_y2-prediction/smoke_test_ks (lazy-eval), 1 rank, Lassen CPU (y-axes are in "MByte"):

with SVM mempool: ![image](https://user-images.githubusercontent.com/1772435/220432736-89515696-f96d-4e28-8bf1-0d29699e9e73.png) with non-pool SVM: ![image](https://user-images.githubusercontent.com/1772435/220432860-6deca9f4-7e9e-4f7b-bb1e-8f374542bd8b.png)

both seem to "level off" after ~140 steps, but memory is likely to grow in the future. see e.g. this graph for a different Lassen run (SVM pool):

![image](https://user-images.githubusercontent.com/1772435/220434199-58035b8f-16c1-45f5-85cd-5b3259b98b24.png)
These results are qualitatively reproducible between runs, but quantitatively differ widely (even when rerunning the same exact configuration).
Using CL buffers vs. SVM allocations seem to show the same behavior.

inducer commented 1 year ago

Please confirm that the relevant growth is only of memory allocated via OpenCL.
I gather that you are using some (unspecified?) system/process-level metric of memory usage. What do things look like at the level of the OpenCL API? If you keep a running tally of memory allocated via OpenCL, does that grow as well or stay constant?

Btw, please keep vertical space in mind when writing issue text. Write claims, and hide supporting evidence under a <details>. I've done that for your comment above.

matthiasdiener commented 1 year ago

Tracing the memory pool allocations with set_trace (and using https://github.com/illinois-ceesd/mirgecom/pull/840) with the same config as before (1 rank, smoke_test_ks, CPU) revealed some interesting information:

the (SVM) memory pool keeps growing throughout execution, although for some reason at step 38 active bytes goes down by 75%
throughout the whole execution, at least some memory pool requests required new allocations, i.e. [pool] allocation of size 1511472 required new memory

![image](https://user-images.githubusercontent.com/1772435/220485497-2d650cac-3468-416a-b1b3-573a36e9bfb9.png) - logpyle file: [svm_pool_trace.log](https://github.com/illinois-ceesd/mirgecom/files/10798817/svm_pool_trace.log) - trace log (up to step 42): https://gist.github.com/matthiasdiener/c7ebdc759829daebb13f790d2de15373 > How do your findings change if you call [free_held](https://documen.tician.de/pyopencl/tools.html#pyopencl.tools.SVMPool.free_held) between steps? Looking at the graph above, it seems like freeing the held memory may not help?

I gather that you are using some (unspecified?) system/process-level metric of memory usage.

The memory usage I initially added here is the RSS high water mark measured with https://github.com/illinois-ceesd/logpyle/pull/79 (= max_rss).

inducer commented 1 year ago

Thanks. This tally of pool-held memory means (to me) that the issue is very likely "above" the pool, i.e. in Python. I.e., replacing the memory allocation scheme used by the pool should not help, or at least not much.

My read of this is that some member of a group of objects that cyclically refer to each other holds a reference to our arrays. This follows because Python's refcounting frees objects without cyclic referents effectively instantaneously, i.e. as soon as a reference to them is no longer being held.

To validate the latter conclusion, you could try calling gc.collect() every $N$ time steps to see if that helps free those objects. (Of course, that won't do much if there is some cyclic behavior in what references are held.)

Assuming the above conclusion is correct, the way to address this would be to find the objects referring to the arrays and make it so they no longer hold those references.

There is some support for that in the gc module: https://docs.python.org/3/library/gc.html#gc.get_referrers
There is a very bare-bones interactive reference graph explorer in pytools which might be of help.

matthiasdiener commented 1 year ago

What is the simplest driver that exhibits the growth? I gather from @lukeolson that, maybe, examples/wave-lazy.py may be affected. Could you please confirm? Is grudge/examples/wave/wave-min-mpi.py affected as well? Is, say, vortex-mpi.py affected? Grudge's Euler?

I've seen the growth in all drivers I tried, including the simplest ones:

Mirgecom's wave, wave-mpi
Grudge's euler/vortex, wave/wave-op-mpi

The growth only happens in lazy mode, not eager. The specific memory pool used (SVM, CL buffer) or lazy actx class do not seem to matter.

Graph for mirgecom's wave:

![image](https://user-images.githubusercontent.com/1772435/220742274-587503e4-90db-4f2d-ac1a-842317bf505b.png)

matthiasdiener commented 1 year ago

To validate the latter conclusion, you could try calling gc.collect() every N time steps to see if that helps free those objects. (Of course, that won't do much if there is some cyclic behavior in what references are held.)

It does seem that running gc.collect ~~resolves~~ mitigates this issue for us. The following results are for smoke_test_ks, but it is similar for the simpler testcases.

GC collect every 10 steps (no measurable performance overhead) : ![image](https://user-images.githubusercontent.com/1772435/220786416-8a82672d-0d1e-4056-85ee-650021375841.png) GC collect every 1 step (~25% performance overhead): ![image](https://user-images.githubusercontent.com/1772435/220786754-cda073cf-b2a4-4c7a-8be6-0f2aa292b03c.png)

inducer commented 1 year ago

It's important that gc.collect is not a solution, but a workaround. It's quite expensive (and should be unnecessary), and it only masks the problem.

MTCam commented 1 year ago

It's important that gc.collect is not a solution, but a workaround. It's quite expensive (and should be unnecessary), and it only masks the problem. 👍

I like your idea of running it every $N$ steps, though. This workaround can likely keep us running comfortably in the interim. afaict, after injecting this fix into the prediction driver, the code infrastructure is now capable of production-scale prediction-like runs, and at the very least in good shape for February trials (leaps and bounds over last year). Gigantic cool.

matthiasdiener commented 1 year ago

A few more updates for mirgecom's wave (w/ lazy eval):

When running without any gc invocations or gc config changes, gc.garbage is empty (which is expected I think).
When running with gc.set_debug(gc.DEBUG_SAVEALL), gc.garbage contains ~62000 objects after the first time step. Each subsequent time step adds about ~1000 objects. Is my assumption correct that those objects are the ones we suspect of having circular references (+holding a reference to arrays)? I was adapting this code https://code.activestate.com/recipes/523004-find-cyclical-references/ to check if there are array references in the objects with circular references, but this appears to be extremely time consuming.

Edit:

An interesting article about GC: https://devguide.python.org/internals/garbage-collector/index.html

illinois-ceesd / mirgecom

Memory growth #839