illinois-ceesd / mirgecom

MIRGE-Com is the workhorse simulation application for the Center for Exascale-Enabled Scramjet Design at the University of Illinois.
Other
11 stars 19 forks source link

Memory growth #839

Open inducer opened 1 year ago

inducer commented 1 year ago

I understand that there is some type of memory growth occurring.

From the 2023-02-17 dev meeting notes, I gather that

Possibly related: #212.

cc @matthiasdiener

inducer commented 1 year ago

Is the growth reflected in

I.e. do those increase timestep-over-timestep?

If there is growth, can identify what bins in the memory pool are affected? Can you identify which allocations? Python makes it straightforward to attach stack traces to allocations.

Do we know if this growth is of "array" memory or "other" memory?

How do your findings change if you call free_held between steps?

What is the simplest driver that exhibits the growth? I gather from @lukeolson that, maybe, examples/wave-lazy.py may be affected. Could you please confirm? Is grudge/examples/wave/wave-min-mpi.py affected as well? Is, say, vortex-mpi.py affected? Grudge's Euler?

lukeolson commented 1 year ago

Also, it looks like set_trace is exposed so you could get some additional information from that: https://github.com/inducer/pyopencl/blob/main/src/mempool.hpp#L164

including bin size data

matthiasdiener commented 1 year ago
  • that memory growth only occurs when using the memory pool

The growth happens both with and without the pool. Here is an example with drivers_y2-prediction/smoke_test_ks (lazy-eval), 1 rank, Lassen CPU (y-axes are in "MByte"):

with SVM mempool: ![image](https://user-images.githubusercontent.com/1772435/220432736-89515696-f96d-4e28-8bf1-0d29699e9e73.png) with non-pool SVM: ![image](https://user-images.githubusercontent.com/1772435/220432860-6deca9f4-7e9e-4f7b-bb1e-8f374542bd8b.png)
inducer commented 1 year ago

Btw, please keep vertical space in mind when writing issue text. Write claims, and hide supporting evidence under a <details>. I've done that for your comment above.

matthiasdiener commented 1 year ago

Tracing the memory pool allocations with set_trace (and using https://github.com/illinois-ceesd/mirgecom/pull/840) with the same config as before (1 rank, smoke_test_ks, CPU) revealed some interesting information:

![image](https://user-images.githubusercontent.com/1772435/220485497-2d650cac-3468-416a-b1b3-573a36e9bfb9.png) - logpyle file: [svm_pool_trace.log](https://github.com/illinois-ceesd/mirgecom/files/10798817/svm_pool_trace.log) - trace log (up to step 42): https://gist.github.com/matthiasdiener/c7ebdc759829daebb13f790d2de15373 > How do your findings change if you call [free_held](https://documen.tician.de/pyopencl/tools.html#pyopencl.tools.SVMPool.free_held) between steps? Looking at the graph above, it seems like freeing the held memory may not help?

I gather that you are using some (unspecified?) system/process-level metric of memory usage.

The memory usage I initially added here is the RSS high water mark measured with https://github.com/illinois-ceesd/logpyle/pull/79 (= max_rss).

inducer commented 1 year ago

Thanks. This tally of pool-held memory means (to me) that the issue is very likely "above" the pool, i.e. in Python. I.e., replacing the memory allocation scheme used by the pool should not help, or at least not much.

My read of this is that some member of a group of objects that cyclically refer to each other holds a reference to our arrays. This follows because Python's refcounting frees objects without cyclic referents effectively instantaneously, i.e. as soon as a reference to them is no longer being held.

To validate the latter conclusion, you could try calling gc.collect() every $N$ time steps to see if that helps free those objects. (Of course, that won't do much if there is some cyclic behavior in what references are held.)

Assuming the above conclusion is correct, the way to address this would be to find the objects referring to the arrays and make it so they no longer hold those references.

matthiasdiener commented 1 year ago

What is the simplest driver that exhibits the growth? I gather from @lukeolson that, maybe, examples/wave-lazy.py may be affected. Could you please confirm? Is grudge/examples/wave/wave-min-mpi.py affected as well? Is, say, vortex-mpi.py affected? Grudge's Euler?

I've seen the growth in all drivers I tried, including the simplest ones:

The growth only happens in lazy mode, not eager. The specific memory pool used (SVM, CL buffer) or lazy actx class do not seem to matter.

Graph for mirgecom's wave:

![image](https://user-images.githubusercontent.com/1772435/220742274-587503e4-90db-4f2d-ac1a-842317bf505b.png)
matthiasdiener commented 1 year ago

To validate the latter conclusion, you could try calling gc.collect() every N time steps to see if that helps free those objects. (Of course, that won't do much if there is some cyclic behavior in what references are held.)

It does seem that running gc.collect resolves mitigates this issue for us. The following results are for smoke_test_ks, but it is similar for the simpler testcases.

GC collect every 10 steps (no measurable performance overhead) : ![image](https://user-images.githubusercontent.com/1772435/220786416-8a82672d-0d1e-4056-85ee-650021375841.png) GC collect every 1 step (~25% performance overhead): ![image](https://user-images.githubusercontent.com/1772435/220786754-cda073cf-b2a4-4c7a-8be6-0f2aa292b03c.png)
inducer commented 1 year ago

It's important that gc.collect is not a solution, but a workaround. It's quite expensive (and should be unnecessary), and it only masks the problem.

MTCam commented 1 year ago

It's important that gc.collect is not a solution, but a workaround. It's quite expensive (and should be unnecessary), and it only masks the problem. 👍

I like your idea of running it every $N$ steps, though. This workaround can likely keep us running comfortably in the interim. afaict, after injecting this fix into the prediction driver, the code infrastructure is now capable of production-scale prediction-like runs, and at the very least in good shape for February trials (leaps and bounds over last year). Gigantic cool.

matthiasdiener commented 1 year ago

A few more updates for mirgecom's wave (w/ lazy eval):

Edit: