MTCam commented 3 years ago

This diffusion test that uses over-integration seems to require a great deal of memory in CI, causing subsequent crash of pytest without error messages in #95.

Here's a plot of the memory usage of pytest with and without the diffusion accuracy test. The blue data has the diffusion accuracy test disabled. When the diffusion accuracy test is disabled, CI in #95 runs to completion with all tests passing. When the diffusion test is enabled, pytest crashes inexplicably in the next suite/file.

Screen Shot 2021-02-03 at 8 33 05 AM

MTCam commented 3 years ago

Appears to be fixed by #211.

inducer commented 3 years ago

211 manages to paper over this, sure, but I wouldn't call this a fix. #150 reports a similar memory growth issue. I propose we devote some time to finding out what's going on. There's no reason why memory use should grow like this.

majosm commented 3 years ago

I'm seeing different behavior on my laptop:

Figure_2

This is with my old diffusion dev environment though. I'm currently re-running with a fresh emirge.

majosm commented 3 years ago

Hmmm, nope. No memory blowup with fresh emirge either:

Figure_1

inducer commented 3 years ago

Hmm... interesting. Are you both running the same thing and getting different results?

MTCam commented 3 years ago

I'm running it again, it's just taking a long time. Update as soon as the data is in.

majosm commented 3 years ago

Seems like it. I'm running pytest with -sv, but I don't see why that would make a difference. I can try a run without just to make sure.

MTCam commented 3 years ago

Seems like it. I'm running pytest with -sv, but I don't see why that would make a difference. I can try a run without just to make sure.

A little redundant, but just wanted to make sure I run exactly the same thing. This was the diffusion test only on fresh emirge/master.

Screen Shot 2021-02-03 at 11 27 16 AM

matthiasdiener commented 3 years ago

Which OpenCL runtime are you using?

majosm commented 3 years ago

We're both running with <pyopencl.Device 'pthread' on 'Portable Computing Language'>.

Our environments also look pretty similar:

diff --git a/conda_list_matt.txt b/conda_list_mike.txt
index e5d90c2..3d98384 100644
--- a/conda_list_matt.txt
+++ b/conda_list_mike.txt
@@ -1,4 +1,4 @@
 #
 # Name                    Version                   Build  Channel
 alabaster                 0.7.12                   pypi_0    pypi
@@ -73,9 +73,9 @@ pocl                      1.6                  h4841041_1    conda-forge
 psutil                    5.8.0                    pypi_0    pypi
 pudb                      2020.1                   pypi_0    pypi
 py                        1.10.0                   pypi_0    pypi
-pybind11                  2.6.2                    pypi_0    pypi
 pycodestyle               2.6.0                    pypi_0    pypi
 pyflakes                  2.2.0                    pypi_0    pypi
+pygments                  2.7.4                    pypi_0    pypi
 pymbolic                  2020.2                    dev_0    <develop>
 pymetis                   2020.1           py38hb45949e_2    conda-forge
 pyopencl                  2021.1           py38h73a25cc_0    conda-forge
@@ -84,6 +84,7 @@ pytato                    2020.1.1                  dev_0    <develop>
 pytest                    6.2.2                    pypi_0    pypi
 pytest-pudb               0.7.0                    pypi_0    pypi
 python                    3.8.6           h624753d_5_cpython    conda-forge
+python-dateutil           2.8.1                    pypi_0    pypi
 python_abi                3.8                      1_cp38    conda-forge
 pytools                   2021.1             pyhd3deb0d_0    conda-forge
 pytz                      2021.1                   pypi_0    pypi

Not sure what's up with the pybind11 difference though.

inducer commented 3 years ago

Not sure what's up with the pybind11 difference though.

That shouldn't matter. pybind11 is only needed when bindings are built from source. (i.e. it's not used at runtime by anything)

majosm commented 3 years ago

I'm seeing similar leaky behavior as @MTCam when I run on lassen. Even with the diffusion operator call commented out and the RHS set to 0, the memory usage still grows pretty quickly (black line below). Interestingly, if I comment out the timestepping loop, most of the growth disappears (blue line).

Figure_0

If I add more test cases (orders 4 and 5) it keeps growing:

Figure_1

Same if I instead increase the number of timesteps in each test (50 to 150):

Figure_2

majosm commented 3 years ago

Also interesting -- I changed the test to have it explicitly construct a PyOpenCLArrayContext so I could try using a different allocator. Here's the default allocator (black) vs. memory pool allocator (blue):

Figure_4

inducer commented 3 years ago

memory pool allocator

Wow. I would not have expected that. I wonder why the default allocator is so sluggish/dysfunctional in returning memory to the system.

matthiasdiener commented 3 years ago

memory pool allocator

Wow. I would not have expected that. I wonder why the default allocator is so sluggish/dysfunctional in returning memory to the system.

Would it make sense to change the default allocator to the MemoryPool in array contexts? Other people have probably hit the same issue.

inducer commented 3 years ago

IMO no. If we do that, we won't know what the problem is, we'll just have papered over it (so it could come back any moment).

majosm commented 3 years ago

Here's some valgrind output. It measures a single execution of test_diffusion_accuracy, modified to return 0 for the RHS instead of calling diffusion_operator. The spot where the memory usage jumps up appears to be the timestepping loop (because I can vary the peak memory usage by changing the number of steps).

https://gist.github.com/inducer/2956323608953cbf7b3246fe5a426d33

inducer commented 3 years ago

(@majosm: Stuck your valgrind output in a gist, to keep this issue manageable. Hope that's OK with you.)

inducer commented 3 years ago

I think I found a manageable reproducer without pytest that takes only a few seconds to run. I'm on f89f563, and I've applied this diff:

diff --git a/test/test_euler.py b/test/test_euler.py
index 4e86e45..8121a8b 100644
--- a/test/test_euler.py
+++ b/test/test_euler.py
@@ -882,3 +882,9 @@ def test_isentropic_vortex(actx_factory, order):
         eoc_rec.order_estimate() >= order - 0.5
         or eoc_rec.max_error() < 1e-11
     )
+
+
+if __name__ == "__main__":
+    import meshmode
+    for i in range(6):
+        test_inviscid_flux(meshmode._acf, 20, 3)

Using that (pocl 1.6, built from source), running

PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py

gives me this plot out of memory-profiler's mplot:

grafik

whereas using the Intel OpenCL runtime, running

PYOPENCL_TEST=int mprof run -T 0.01 python test_euler.py

gives me this much more sensible plot: grafik

inducer commented 3 years ago

Patching https://github.com/pocl/pocl/commit/a2d016c84d2034f43062d7f22b4874cfffe5c127 with this diff:

diff --git a/lib/CL/devices/basic/basic.c b/lib/CL/devices/basic/basic.c
index 5022970a..a912aced 100644
--- a/lib/CL/devices/basic/basic.c
+++ b/lib/CL/devices/basic/basic.c
@@ -212,7 +212,6 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
   void *b = NULL;
   cl_mem_flags flags = mem_obj->flags;
   unsigned i;
-  POCL_MSG_PRINT_MEMORY (" mem %p, dev %d\n", mem_obj, device->dev_id);
   /* check if some driver has already allocated memory for this mem_obj 
      in our global address space, and use that*/
   for (i = 0; i < mem_obj->context->num_devices; ++i)
@@ -231,6 +230,7 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
           return CL_SUCCESS;
         }
     }
+  POCL_MSG_PRINT_MEMORY (" mem_alloc %p, dev %d, size %d\n", mem_obj, device->dev_id, mem_obj->size);

   /* memory for this global memory is not yet allocated -> do it */
   if (flags & CL_MEM_USE_HOST_PTR)
@@ -241,7 +241,7 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
     }
   else
     {
-      POCL_MSG_PRINT_MEMORY ("!USE_HOST_PTR\n");
+      // POCL_MSG_PRINT_MEMORY ("!USE_HOST_PTR\n");
       b = pocl_aligned_malloc_global_mem (device, MAX_EXTENDED_ALIGNMENT,
                                           mem_obj->size);
       if (b==NULL)
@@ -273,6 +273,8 @@ pocl_basic_free (cl_device_id device, cl_mem memobj)
 {
   cl_mem_flags flags = memobj->flags;

+  POCL_MSG_PRINT_MEMORY (" mem_free %p, dev %d, size %d\n", memobj, device->dev_id, memobj->size);
+
   /* allocation owner executes freeing */
   if (flags & CL_MEM_USE_HOST_PTR ||
       memobj->shared_mem_allocation_owner != device)

and using this log reader on the resulting log gives this plot: grafik

So it seems that pocl is at least trying to return memory to the operating system...?

inducer commented 3 years ago

A potentially interesting finding that corroborates the observation that the memory consumption is system-dependent: It seems that this has everything to do with the implementation of malloc being used, and potentially with the fragmentation being introduced by the allocator.

This (i.e. using the excellent jemalloc):

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 POCL_DEBUG=memory PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py > out 2>&1

gives:

grafik

With my libc's allocator by contrast, you can pretty much see the fragmentation happening. This is plotting the returned addresses vs. time:

grafik

Observe how the allocator starts returning memory in kind of a wild order and then the allocation addresses shoot higher and higher. This may be an interaction between the allocator and our allocation pattern: pocl allocating big chunks and small, administrative things in about equal proportions and the Python runtime allocating lots of tiny things. The libc allocator (glibc 2.31-9 from Debian in my case) may simply not be handling this load well. To get a memory usage log as clean as Intel CL, I suppose we could try to use a separate allocator for CL buffers.

jemalloc is liberally-licensed open-source and, conveniently, already in conda-forge for all platforms that we care about. ~~It seems that building pocl against jemalloc might avoid our memory growth.~~ jemalloc (by default) overrides malloc, which means that it needs to be linked in by the main executable (python), i.e. not happening. We could build a feedstock of a jemalloc with a prefix (je_malloc e.g.) and use that in pocl, but that seems overkill. I think simply LD_PRELOADing jemalloc for CI is probably OK. See #246.

Bonus data point: Using the malloc implementation from Intel's TBB:

 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtbbmalloc_proxy.so.2 POCL_DEBUG=memory PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py > out 2>&1

gives grafik i.e. an improvement over libc, but not nearly as good as jemalloc.

inducer commented 3 years ago

Because I was curious and making the plot was cheap: This is the allocation address pattern from jemalloc: grafik

inducer commented 3 years ago

This is the plot for python -m pytest test_euler.py with jemalloc, which would previously leak like a sieve:

grafik

MTCam commented 3 years ago

On my Intel(Core i7) Mac (Catalina 10.15.4) After:

conda install jemalloc
DYLD_INSERT_LIBRARIES=omitted/lib/libjemalloc.2.dylib mprof run pytest test_diffusion.py
mprof plot

newmemusage

edit: I almost have to believe it wasn't using a different allocator here. This looks identical to the other tests I ran.

(#246)

inducer commented 3 years ago

edit: I almost have to believe it wasn't using a different allocator here. This looks identical to the other tests I ran.

I think you might be right.

Quoting from https://github.com/jemalloc/jemalloc/blob/041145c272711b55f91aa42128b108674a12fd91/INSTALL.md:

By default, the prefix is "", except on OS X, where it is "je_". On OS X, jemalloc overlays the default malloc zone, but makes no attempt to actually replace the "malloc", "calloc", etc. symbols.

AFAICT, the conda forge feedstock does not override this. So it's not a surprise that this didn't do much on a Mac. We'll need a different solution then. Could you try with libtbbmalloc_proxy.dylib from https://github.com/conda-forge/tbb-feedstock?

MTCam commented 3 years ago

Quoting from jemalloc/jemalloc@041145c/INSTALL.md:

By default, the prefix is "", except on OS X, where it is "je_". On OS X, jemalloc overlays the default malloc zone, but makes no attempt to actually replace the "malloc", "calloc", etc. symbols.

Heh, wut? :thinking:

MTCam commented 3 years ago

Could you try with libtbbmalloc_proxy.dylib from conda-forge/tbb-feedstock?

Definitely different!

Screen Shot 2021-02-10 at 1 54 45 PM

edit: But at least something got returned?

inducer commented 3 years ago

Yikes! That looks worse.

majosm commented 3 years ago

I tried using the tracemalloc module to see what was being allocated during time stepping. Here are some results:

tracemalloc w/ zero RHS using DOFArray (code here) tracemalloc w/ zero RHS using numpy (code here)

The DOFArray version definitely does more small memory allocations than the equivalent numpy code. The numpy version does not exhibit the memory growth issue.

A large number of allocations are actually coming from a deprecation warning in numpy that gets triggered by the pyopencl array constructor. Unfortunately, fixing this doesn't seem to help with the memory deallocation issue. Would it be possible/worthwhile to remove some of these other allocations? Not sure how much they would come into play in a lazy evaluation context.

inducer commented 3 years ago

Thanks! I've actually never used tracemalloc, so I found this very informative.

/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/site-packages/numpy/__init__.py:287: size=1286 KiB, count=25806, average=51 B
  File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/site-packages/numpy/__init__.py", line 287
    warnings.warn(msg, DeprecationWarning, stacklevel=2)

Kind of confused what these are. We're allocating 25806 numpy arrays? For what? Are those the object arrays? (but they're mighty big for that) Or is this the warning message? (Is the size= an aggregate measure?)

/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py:405: size=906 KiB, count=12877, average=72 B
  File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py", line 405
    self.message = message
/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py:412: size=425 KiB, count=6500, average=67 B
  File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py", line 412
    self._category_name = category.__name__ if category else None

What are these? Any idea?

Some more thoughts:

Does this see/instrument the memory allocations from pocl at all? (I'm not sure it does.) (If not: can we change that?)
How come numpy manages to do its entire time stepping with just four allocations?
I saw that tracemalloc has a handy "compare" mode. We could try that to see what (if any) memory sticks around between timesteps.

majosm commented 3 years ago

Kind of confused what these are. We're allocating 25806 numpy arrays? For what? Are those the object arrays? (but they're mighty big for that) Or is this the warning message? (Is the size= an aggregate measure?)

What's being allocated is something related to the warning (not sure if it's the message, or the deprecation warning object, or what). It looks like the size is the sum of all the allocations.

What are these? Any idea?

Also seems to be related to the warnings. All of these go away if I replace the two instances of np.object with object in pyopencl/array.py.

Does this see/instrument the memory allocations from pocl at all? (I'm not sure it does.) (If not: can we change that?)

I don't think it does. Sounds like it just instruments the memory allocation code on the Python side.

How come numpy manages to do its entire time stepping with just four allocations?

From what I understand tracemalloc only displays what's currently allocated at the time of the snapshot (and also only out of those objects that were allocated after tracemalloc.start()). Garbage collection complicates things a little. I'm going to try some more runs with garbage collection disabled so we can see everything that was allocated during stepping.

majosm commented 3 years ago

Hmm, I get the same 4 allocations with numpy after disabling GC. And the reported sizes for the 3 large allocations seem odd, too -- they're listed as averaging 416 KiB, which is 1/3 of a DOF array.

Here are some results for a single step (code here):

Without GC collect afterwards With GC collect afterwards

MTCam commented 3 years ago

It seems that disabling warnings helps avoid the issue on my laptop: pytest -W ignore test_diffusion.py

inducer commented 3 years ago

~~Had an OpenCL-related theory for something that could be contributing: https://github.com/inducer/pyopencl/issues/450~~ Turns out that's not it.

thomasgibson commented 3 years ago

Here are some recent updates (using test_diffusion.py in mirgecom), ran on my local machine. DYLD_INSERT_LIBRARIES=omitted/lib/libjemalloc.2.dylib mprof run pytest test_diffusion.py

Main: main_mirgecom_difftest_jemalloc

And here is the same experiment using new world Grudge https://github.com/inducer/grudge/pull/74 nwg_mirgecom_difftest_jemalloc

thomasgibson commented 3 years ago

Same thing but for the entire mirgecom test suite. Here's main: main_mirgecom_alltests And now with the revamped grudge: nwg_mirgecom_alltests

illinois-ceesd / mirgecom

Pocl memory leaks (not just) during CI #212

211 manages to paper over this, sure, but I wouldn't call this a fix. #150 reports a similar memory growth issue. I propose we devote some time to finding out what's going on. There's no reason why memory use should grow like this.