Open MTCam opened 3 years ago
Appears to be fixed by #211.
I'm seeing different behavior on my laptop:
This is with my old diffusion dev environment though. I'm currently re-running with a fresh emirge.
Hmmm, nope. No memory blowup with fresh emirge either:
Hmm... interesting. Are you both running the same thing and getting different results?
I'm running it again, it's just taking a long time. Update as soon as the data is in.
Seems like it. I'm running pytest with -sv
, but I don't see why that would make a difference. I can try a run without just to make sure.
Seems like it. I'm running pytest with
-sv
, but I don't see why that would make a difference. I can try a run without just to make sure.
A little redundant, but just wanted to make sure I run exactly the same thing. This was the diffusion test only on fresh emirge/master.
Which OpenCL runtime are you using?
We're both running with <pyopencl.Device 'pthread' on 'Portable Computing Language'>
.
Our environments also look pretty similar:
diff --git a/conda_list_matt.txt b/conda_list_mike.txt
index e5d90c2..3d98384 100644
--- a/conda_list_matt.txt
+++ b/conda_list_mike.txt
@@ -1,4 +1,4 @@
#
# Name Version Build Channel
alabaster 0.7.12 pypi_0 pypi
@@ -73,9 +73,9 @@ pocl 1.6 h4841041_1 conda-forge
psutil 5.8.0 pypi_0 pypi
pudb 2020.1 pypi_0 pypi
py 1.10.0 pypi_0 pypi
-pybind11 2.6.2 pypi_0 pypi
pycodestyle 2.6.0 pypi_0 pypi
pyflakes 2.2.0 pypi_0 pypi
+pygments 2.7.4 pypi_0 pypi
pymbolic 2020.2 dev_0 <develop>
pymetis 2020.1 py38hb45949e_2 conda-forge
pyopencl 2021.1 py38h73a25cc_0 conda-forge
@@ -84,6 +84,7 @@ pytato 2020.1.1 dev_0 <develop>
pytest 6.2.2 pypi_0 pypi
pytest-pudb 0.7.0 pypi_0 pypi
python 3.8.6 h624753d_5_cpython conda-forge
+python-dateutil 2.8.1 pypi_0 pypi
python_abi 3.8 1_cp38 conda-forge
pytools 2021.1 pyhd3deb0d_0 conda-forge
pytz 2021.1 pypi_0 pypi
Not sure what's up with the pybind11
difference though.
Not sure what's up with the pybind11 difference though.
That shouldn't matter. pybind11 is only needed when bindings are built from source. (i.e. it's not used at runtime by anything)
I'm seeing similar leaky behavior as @MTCam when I run on lassen. Even with the diffusion operator call commented out and the RHS set to 0, the memory usage still grows pretty quickly (black line below). Interestingly, if I comment out the timestepping loop, most of the growth disappears (blue line).
If I add more test cases (orders 4 and 5) it keeps growing:
Same if I instead increase the number of timesteps in each test (50 to 150):
Also interesting -- I changed the test to have it explicitly construct a PyOpenCLArrayContext
so I could try using a different allocator. Here's the default allocator (black) vs. memory pool allocator (blue):
memory pool allocator
Wow. I would not have expected that. I wonder why the default allocator is so sluggish/dysfunctional in returning memory to the system.
memory pool allocator
Wow. I would not have expected that. I wonder why the default allocator is so sluggish/dysfunctional in returning memory to the system.
Would it make sense to change the default allocator to the MemoryPool in array contexts? Other people have probably hit the same issue.
IMO no. If we do that, we won't know what the problem is, we'll just have papered over it (so it could come back any moment).
Here's some valgrind output. It measures a single execution of test_diffusion_accuracy
, modified to return 0 for the RHS instead of calling diffusion_operator
. The spot where the memory usage jumps up appears to be the timestepping loop (because I can vary the peak memory usage by changing the number of steps).
https://gist.github.com/inducer/2956323608953cbf7b3246fe5a426d33
(@majosm: Stuck your valgrind output in a gist, to keep this issue manageable. Hope that's OK with you.)
I think I found a manageable reproducer without pytest that takes only a few seconds to run. I'm on f89f563, and I've applied this diff:
diff --git a/test/test_euler.py b/test/test_euler.py
index 4e86e45..8121a8b 100644
--- a/test/test_euler.py
+++ b/test/test_euler.py
@@ -882,3 +882,9 @@ def test_isentropic_vortex(actx_factory, order):
eoc_rec.order_estimate() >= order - 0.5
or eoc_rec.max_error() < 1e-11
)
+
+
+if __name__ == "__main__":
+ import meshmode
+ for i in range(6):
+ test_inviscid_flux(meshmode._acf, 20, 3)
Using that (pocl 1.6, built from source), running
PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py
gives me this plot out of memory-profiler's mplot:
whereas using the Intel OpenCL runtime, running
PYOPENCL_TEST=int mprof run -T 0.01 python test_euler.py
gives me this much more sensible plot:
Patching https://github.com/pocl/pocl/commit/a2d016c84d2034f43062d7f22b4874cfffe5c127 with this diff:
diff --git a/lib/CL/devices/basic/basic.c b/lib/CL/devices/basic/basic.c
index 5022970a..a912aced 100644
--- a/lib/CL/devices/basic/basic.c
+++ b/lib/CL/devices/basic/basic.c
@@ -212,7 +212,6 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
void *b = NULL;
cl_mem_flags flags = mem_obj->flags;
unsigned i;
- POCL_MSG_PRINT_MEMORY (" mem %p, dev %d\n", mem_obj, device->dev_id);
/* check if some driver has already allocated memory for this mem_obj
in our global address space, and use that*/
for (i = 0; i < mem_obj->context->num_devices; ++i)
@@ -231,6 +230,7 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
return CL_SUCCESS;
}
}
+ POCL_MSG_PRINT_MEMORY (" mem_alloc %p, dev %d, size %d\n", mem_obj, device->dev_id, mem_obj->size);
/* memory for this global memory is not yet allocated -> do it */
if (flags & CL_MEM_USE_HOST_PTR)
@@ -241,7 +241,7 @@ pocl_basic_alloc_mem_obj (cl_device_id device, cl_mem mem_obj, void* host_ptr)
}
else
{
- POCL_MSG_PRINT_MEMORY ("!USE_HOST_PTR\n");
+ // POCL_MSG_PRINT_MEMORY ("!USE_HOST_PTR\n");
b = pocl_aligned_malloc_global_mem (device, MAX_EXTENDED_ALIGNMENT,
mem_obj->size);
if (b==NULL)
@@ -273,6 +273,8 @@ pocl_basic_free (cl_device_id device, cl_mem memobj)
{
cl_mem_flags flags = memobj->flags;
+ POCL_MSG_PRINT_MEMORY (" mem_free %p, dev %d, size %d\n", memobj, device->dev_id, memobj->size);
+
/* allocation owner executes freeing */
if (flags & CL_MEM_USE_HOST_PTR ||
memobj->shared_mem_allocation_owner != device)
and using this log reader on the resulting log gives this plot:
So it seems that pocl is at least trying to return memory to the operating system...?
A potentially interesting finding that corroborates the observation that the memory consumption is system-dependent: It seems that this has everything to do with the implementation of malloc
being used, and potentially with the fragmentation being introduced by the allocator.
This (i.e. using the excellent jemalloc):
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 POCL_DEBUG=memory PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py > out 2>&1
gives:
With my libc's allocator by contrast, you can pretty much see the fragmentation happening. This is plotting the returned addresses vs. time:
Observe how the allocator starts returning memory in kind of a wild order and then the allocation addresses shoot higher and higher. This may be an interaction between the allocator and our allocation pattern: pocl allocating big chunks and small, administrative things in about equal proportions and the Python runtime allocating lots of tiny things. The libc allocator (glibc 2.31-9 from Debian in my case) may simply not be handling this load well. To get a memory usage log as clean as Intel CL, I suppose we could try to use a separate allocator for CL buffers.
jemalloc is liberally-licensed open-source and, conveniently, already in conda-forge for all platforms that we care about. It seems that building pocl against jemalloc might avoid our memory growth. jemalloc (by default) overrides malloc
, which means that it needs to be linked in by the main executable (python), i.e. not happening. We could build a feedstock of a jemalloc with a prefix (je_malloc
e.g.) and use that in pocl, but that seems overkill. I think simply LD_PRELOAD
ing jemalloc for CI is probably OK. See #246.
Bonus data point: Using the malloc implementation from Intel's TBB:
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtbbmalloc_proxy.so.2 POCL_DEBUG=memory PYOPENCL_TEST=port:pthread mprof run -T 0.01 python test_euler.py > out 2>&1
gives i.e. an improvement over libc, but not nearly as good as jemalloc.
Because I was curious and making the plot was cheap: This is the allocation address pattern from jemalloc:
This is the plot for python -m pytest test_euler.py
with jemalloc, which would previously leak like a sieve:
On my Intel(Core i7) Mac (Catalina 10.15.4) After:
conda install jemalloc
DYLD_INSERT_LIBRARIES=omitted/lib/libjemalloc.2.dylib mprof run pytest test_diffusion.py
mprof plot
edit: I almost have to believe it wasn't using a different allocator here. This looks identical to the other tests I ran.
(#246)
edit: I almost have to believe it wasn't using a different allocator here. This looks identical to the other tests I ran.
I think you might be right.
Quoting from https://github.com/jemalloc/jemalloc/blob/041145c272711b55f91aa42128b108674a12fd91/INSTALL.md:
By default, the prefix is "", except on OS X, where it is "je_". On OS X, jemalloc overlays the default malloc zone, but makes no attempt to actually replace the "malloc", "calloc", etc. symbols.
AFAICT, the conda forge feedstock does not override this. So it's not a surprise that this didn't do much on a Mac. We'll need a different solution then. Could you try with libtbbmalloc_proxy.dylib
from https://github.com/conda-forge/tbb-feedstock?
Quoting from jemalloc/jemalloc@
041145c
/INSTALL.md:By default, the prefix is "", except on OS X, where it is "je_". On OS X, jemalloc overlays the default malloc zone, but makes no attempt to actually replace the "malloc", "calloc", etc. symbols.
Heh, wut? :thinking:
Could you try with
libtbbmalloc_proxy.dylib
from conda-forge/tbb-feedstock?
Definitely different!
edit: But at least something got returned?
Yikes! That looks worse.
I tried using the tracemalloc
module to see what was being allocated during time stepping. Here are some results:
tracemalloc w/ zero RHS using DOFArray (code here) tracemalloc w/ zero RHS using numpy (code here)
The DOFArray version definitely does more small memory allocations than the equivalent numpy code. The numpy version does not exhibit the memory growth issue.
A large number of allocations are actually coming from a deprecation warning in numpy that gets triggered by the pyopencl array constructor. Unfortunately, fixing this doesn't seem to help with the memory deallocation issue. Would it be possible/worthwhile to remove some of these other allocations? Not sure how much they would come into play in a lazy evaluation context.
Thanks! I've actually never used tracemalloc, so I found this very informative.
/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/site-packages/numpy/__init__.py:287: size=1286 KiB, count=25806, average=51 B
File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/site-packages/numpy/__init__.py", line 287
warnings.warn(msg, DeprecationWarning, stacklevel=2)
Kind of confused what these are. We're allocating 25806 numpy arrays? For what? Are those the object arrays? (but they're mighty big for that) Or is this the warning message? (Is the size=
an aggregate measure?)
/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py:405: size=906 KiB, count=12877, average=72 B
File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py", line 405
self.message = message
/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py:412: size=425 KiB, count=6500, average=67 B
File "/path/to/misc/tools/miniconda3/envs/emirge-diffusion-memory-issue/lib/python3.8/warnings.py", line 412
self._category_name = category.__name__ if category else None
What are these? Any idea?
Some more thoughts:
Kind of confused what these are. We're allocating 25806 numpy arrays? For what? Are those the object arrays? (but they're mighty big for that) Or is this the warning message? (Is the
size=
an aggregate measure?)
What's being allocated is something related to the warning (not sure if it's the message, or the deprecation warning object, or what). It looks like the size is the sum of all the allocations.
What are these? Any idea?
Also seems to be related to the warnings. All of these go away if I replace the two instances of np.object
with object
in pyopencl/array.py.
- Does this see/instrument the memory allocations from pocl at all? (I'm not sure it does.) (If not: can we change that?)
I don't think it does. Sounds like it just instruments the memory allocation code on the Python side.
- How come numpy manages to do its entire time stepping with just four allocations?
From what I understand tracemalloc only displays what's currently allocated at the time of the snapshot (and also only out of those objects that were allocated after tracemalloc.start()
). Garbage collection complicates things a little. I'm going to try some more runs with garbage collection disabled so we can see everything that was allocated during stepping.
Hmm, I get the same 4 allocations with numpy after disabling GC. And the reported sizes for the 3 large allocations seem odd, too -- they're listed as averaging 416 KiB, which is 1/3 of a DOF array.
Here are some results for a single step (code here):
It seems that disabling warnings helps avoid the issue on my laptop:
pytest -W ignore test_diffusion.py
Had an OpenCL-related theory for something that could be contributing: https://github.com/inducer/pyopencl/issues/450 Turns out that's not it.
Here are some recent updates (using test_diffusion.py
in mirgecom), ran on my local machine. DYLD_INSERT_LIBRARIES=omitted/lib/libjemalloc.2.dylib mprof run pytest test_diffusion.py
Main:
And here is the same experiment using new world Grudge https://github.com/inducer/grudge/pull/74
Same thing but for the entire mirgecom test suite. Here's main: And now with the revamped grudge:
This diffusion test that uses over-integration seems to require a great deal of memory in CI, causing subsequent crash of
pytest
without error messages in #95.Here's a plot of the memory usage of
pytest
with and without the diffusion accuracy test. The blue data has the diffusion accuracy test disabled. When the diffusion accuracy test is disabled, CI in #95 runs to completion with all tests passing. When the diffusion test is enabled,pytest
crashes inexplicably in the next suite/file.