Open inducer opened 3 years ago
Unlike for mirgecom, jemalloc doesn't seem to work wonders for pytential. This is memory use according to memory-profiler for a sequential run of python -m pytest
:
I'm glad it's enough to unbreak the CI (#56, #58), but doesn't appear totally solved.
Using the machinery I made for mirgecom, I'll try to do a running total of CL buffer memory allocated and see if there's any growth there.
jemalloc seems to have some knobs to make it release memory more aggresively. There's a dirty_decay_ms and a muzzy_decay_ms. Should be easy to see what it does by setting
export MALLOC_CONF='dirty_decay_ms:5000,muzzy_decay_ms:5000'
(the default is 10000 (i.e. 10s) and smaller means it releases memory more)
Found this pretty useful too, but no idea how up to date it is.
Using this method for instrumenting memory allocation in pocl, I get the following plot of allocated memory (by pocl) over time:
This seems to support the assertion that, as with mirgecom, the underlying cause for the growth of allocated memory might be memory fragmentation, and, as a result, the allocator requesting more and more memory from the OS. In particular, it appears that, at least from the point of view of OpenCL, all memory that is allocated is also freed.
For reproducibility:
PYOPENCL_TEST=port:pthread POCL_DEBUG=memory ~/shared/with-my-pocl python -m pytest --tb=native -rxsw --capture=no . 2>out
run with 48efe3b and pocl a2d016c8.
Interestingly, replacing pocl's allocator for global memory with jemalloc doesn't seem to do much:
If anything, it releases even less memory. Maybe @alexfikl is right, and this comes down to a knob twiddling.
For reproducibility:
./autogen.sh --prefix=$HOME/shared/pack/jemalloc-build --with-jemalloc-prefix=jemalloc_ --disable-cxx --with-install-suffix=_prefixed --disable-initial-exec-tls
PYOPENCL_TEST=port:pthread mprof run ~/shared/with-my-pocl python -m pytest --tb=native -rxsw
where with-my-pocl
is this script.
As for things that are cheap to do, I discovered another allocator to try.
mimalloc (used globally) is not an improvement:
Tried with jemalloc with the flags given by @alexfikl
MALLOC_CONF='dirty_decay_ms:5000,muzzy_decay_ms:5000' PYOPENCL_TEST=port:pthread LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so mprof run python -m pytest --tb=native -rxsw
Not really all that different from glibc or vanilla jemalloc:
Just to add more random data to this. I ran
mprof run --python -T 0.01 pytest test/test_cost_model.py test/test_scalar_int_eq.py test/test_global_qbx.py test/test_target_specific_qbx.py
This only runs 4 tests, not the whole thing, so should probably taken with a grain of salt.
The different lines are
export MALLOC_TRIM_THRESHOLD_=1024
(default is 128 * 1024
).export MALLOC_CONF='abort:true,dirty_decay_ms:1000,muzzy_decay_ms:1000'
, so a bit more aggresive than setting them to 5000.It does seem like changing some options will not have a big effect vs default jemalloc :\
I've also tried adding
def teardown_function(func):
import gc
gc.collect()
import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_trim(0)
At the top of those tests to force it to free as much as possible, but it didn't seem to improve much. However, it does seem to deallocate memory towards the end there, so it may not continue growing as all the tests run.
Went ahead and ran the whole test suite with the snippet from previous comment added to all files.
def teardown_function(func):
import gc
gc.collect()
import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_trim(0)
Thanks! One additional bit of information: The Python GC already gets run by pyopencl (I think) between tests.
Thanks! One additional bit of information: The Python GC already gets run by pyopencl.
Ah, that explains why I didn't see much of a difference when removing the collect
at some point!
Had an OpenCL-related theory for something that could be contributing: https://github.com/inducer/pyopencl/issues/450
python test_layer_pot.py
with this diff
diff --git a/test/test_layer_pot.py b/test/test_layer_pot.py
index 409eede..a67bb78 100644
--- a/test/test_layer_pot.py
+++ b/test/test_layer_pot.py
@@ -614,15 +614,17 @@ def test_3d_jump_relations(actx_factory, relation, visualize=False):
# }}}
+def run_em_all():
+ import gc
+ for i in range(10):
+ test_off_surface_eval(_acf, True)
+ gc.collect()
+
+
# You can test individual routines by typing
# $ python test_layer_pot.py 'test_routine()'
if __name__ == "__main__":
- import sys
- if len(sys.argv) > 1:
- exec(sys.argv[1])
- else:
- from pytest import main
- main([__file__])
+ run_em_all()
# vim: fdm=marker
seems to be a microcosm of this issue that runs in manageable time. Memory usage goes up in a straight line, when... there's no reason for that, IMO:
The allocations seem to scale with the size of the problem (in this case, set by changing npoints=3000
on line 116), indicating that it's at least some amount of CL memory that's going unfreed. Interestingly, some deallocations do turn up, but in the plot at that size, but there's still a big bunch of leakage:
I'm using the code in this diff to hunt the leaks. One (possibly minor?) contributor to this appears to be ProcessLogger
, which spawns a bunch of threads whose resources appear to never be reaped.
That was a reference cycle, but not a significant contributor: https://github.com/inducer/pytools/pull/181.
Beyond that, it's mostly a bunch of pyopencl "stuff" that stays alive:
<class 'list'>, <class 'list'>, <class 'set'>, <class 'pyopencl._cl.Context'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'dict'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'pyopencl.scan.GenericScanKernel'>, <class 'list'>, <class 'pyopencl._cl.Device'>, <class 'list'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl.scan._BuiltScanKernelInfo'>, <class 'pyopencl.scan._BuiltScanKernelInfo'>, <class 'pyopencl.scan._BuiltFinalUpdateKernelInfo'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl._cl.Platform'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'tuple'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'tuple'>, <class 'tuple'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl.reduction.ReductionKernel'>, <class 'tuple'>, <class 'tuple'>, <class 'list'>, <class 'tuple'>, <class 'list'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl.reduction.ReductionKernel'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, ...`
Hm. Calling pyopencl.tools.clear_first_arg_caches
in each loop makes most of those go away (along with substantially flattening the mprof
curve), so this is probably as expected. One thought is that running pytest in parallel may mean that pyopencl.tools.clear_first_arg_caches
is not getting called as intended. (from here)
Irritatingly, even with zero growth of Python-side object allocation, the test_off_surface_eval
loop still manages to lose about 20MB over 30 runs.
OK, it's flat after switching to jemalloc
, so that's probably not our problem. Switching to the interaction of pytest with PyOpenCL's first-arg caches as the prime suspect.
Well. So I removed that return
from the middle of test_off_surface_eval
, and suddenly we're leaking a bunch of loopy stuff. This may just be the caches though, will keep digging.
Observations:
LoopKernel
s show up in new_objects
in each iteration. They're all for the FMM translations. Why are they there?clear_first_arg_caches
is getting called, from here.https://github.com/inducer/loopy/pull/797 (plus small changes across the stack) should make it so that loopy caches don't keep contexts alive.
Somewhat depressingly, https://github.com/inducer/loopy/pull/797 doesn't seem to have made a decisive difference. Blue is with the changes, black is before.
Been on the hunt for other caches that keep things alive. They contribute, but they're all small potatoes:
I've also printed lists of Python objects that, according to gc
, survive each test. Here's a flavor:
138590, 62 diff, 119 new
[(<class 'builtin_function_or_method'>, 16), (<class 'weakref.ReferenceType'>, 15), (<class 'list'>, 14), (<class 'dict'>, 14), (<class 'cell'>, 10), (<class 'tuple'>, 8), (<class 'generator'>, 6), (<class 'method'>, 5), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class '_pytest.reports.TestReport'>, 3), (<class 'list_reverseiterator'>, 2), (<class 'function'>, 2), (<class 'pytools.persistent_dict._LinkedList'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'set'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class '_io.StringIO'>, 1), (<class 'pytools.persistent_dict._LRUCache'>, 1), (<class 'contextlib._GeneratorContextManager'>, 1)]
.
138728, 138 diff, 195 new
[(<class 'function'>, 39), (<class 'tuple'>, 31), (<class 'dict'>, 25), (<class 'cell'>, 22), (<class 'list'>, 11), (<class 'method'>, 8), (<class 'property'>, 8), (<class 'generator'>, 6), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'pytools.persistent_dict._LRUCache'>, 4), (<class 'type'>, 4), (<class 'weakref.ReferenceType'>, 4), (<class 'getset_descriptor'>, 4), (<class '_pytest.reports.TestReport'>, 2), (<class 'set'>, 1), (<class 'builtin_function_or_method'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'warnings.WarningMessage'>, 1), (<class 'list_reverseiterator'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class '_io.StringIO'>, 1), (<class 'module'>, 1), (<class 'logging.Logger'>, 1), (<class '_frozen_importlib_external.SourceFileLoader'>, 1), (<class '_frozen_importlib.ModuleSpec'>, 1), (<class 'DeprecationWarning'>, 1)]
l_inf error: 0.0025745079426662495
.
138820, 92 diff, 157 new
[(<class 'list'>, 49), (<class 'tuple'>, 20), (<class 'dict'>, 15), (<class 'cell'>, 13), (<class 'method'>, 10), (<class 'pytools.persistent_dict._LinkedList'>, 7), (<class 'pytools.persistent_dict._LRUCache'>, 5), (<class 'pymbolic.primitives.Variable'>, 4), (<class 'function'>, 3), (<class 'pymbolic.primitives.Product'>, 3), (<class 'set'>, 2), (<class 'list_reverseiterator'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class 'generator'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'pymbolic.primitives.Call'>, 2), (<class 'pymbolic.primitives.Subscript'>, 2), (<class '_pytest.terminal.WarningReport'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'sumpy.expansion.LinearPDEBasedExpansionTermsWrangler'>, 1), (<class 'sumpy.expansion.CSEMatVecOperator'>, 1), (<class '_pytest.reports.TestReport'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class 'sumpy.kernel.LaplaceKernel'>, 1), (<class 'pymbolic.primitives.Quotient'>, 1), (<class 'pymbolic.primitives.Sum'>, 1)]
.
139330, 510 diff, 568 new
[(<class 'tuple'>, 83), (<class 'function'>, 79), (<class 'dict'>, 64), (<class 'weakref.ReferenceType'>, 40), (<class 'cell'>, 36), (<class 'builtin_function_or_method'>, 31), (<class 'list'>, 28), (<class 'getset_descriptor'>, 20), (<class 'wrapper_descriptor'>, 13), (<class 'method_descriptor'>, 13), (<class 'pymbolic.primitives.Variable'>, 12), (<class 'set'>, 11), (<class 'module'>, 10), (<class 'type'>, 10), (<class '_frozen_importlib.ModuleSpec'>, 10), (<class 'method'>, 9), (<class '_frozen_importlib_external.SourceFileLoader'>, 8), (<class 'frozenset'>, 7), (<class 'pyopencl.tools._SimpleTextTemplate'>, 6), (<class 'pytools.log_process'>, 6), (<class 'property'>, 5), (<class 'pytential.qbx.target_specific.impl.Enum'>, 5), (<class 'pytools.persistent_dict._LinkedList'>, 5), (<class 'pymbolic.primitives.Call'>, 4), (<class 'pymbolic.primitives.Subscript'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class 'generator'>, 3), (<class 'staticmethod'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3)]
.
139453, 123 diff, 178 new
[(<class 'builtin_function_or_method'>, 33), (<class 'weakref.ReferenceType'>, 32), (<class 'tuple'>, 18), (<class 'dict'>, 14), (<class 'list'>, 12), (<class 'method'>, 9), (<class 'pymbolic.primitives.Variable'>, 6), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class 'generator'>, 5), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class 'set'>, 3), (<class '_pytest.terminal.WarningReport'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3), (<class '_pytest.reports.TestReport'>, 2), (<class 'function'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class 'cell'>, 2), (<class 'pymbolic.primitives.Call'>, 2), (<class 'frozenset'>, 2), (<class 'pymbolic.primitives.Subscript'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'pymbolic.primitives.Sum'>, 1), (<class 'pymbolic.primitives.Product'>, 1), (<class 'list_reverseiterator'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1)]
.
139482, 29 diff, 94 new
[(<class 'list'>, 15), (<class 'dict'>, 11), (<class 'cell'>, 10), (<class 'method'>, 9), (<class 'tuple'>, 8), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class '_pytest.reports.TestReport'>, 3), (<class 'generator'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3), (<class 'list_reverseiterator'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'set'>, 1), (<class 'functools.partial'>, 1), (<class 'function'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1)]
I'm increasingly coming to the conclusion that these leaks might not be Python-visible---these objects are unlikely to explain hundreds of megabytes, I think.
Further confirmation of this is that tracemalloc
(what a nice tool!) seems to see none of the offending allocations.
Can you try with SUMPY_FFT_BACKEND=loopy
?
Thanks for following along! :grinning:
SUMPY_FFT_BACKEND=loopy
takes longer, but otherwise doesn't seem to make a big difference (it's the black curve):
Does something like https://github.com/bloomberg/memray give additional useful information? I've never actually tried it..
EDIT: Possibly even https://github.com/bloomberg/pytest-memray?
I was trying https://github.com/plasma-umass/scalene just now, but it seemed to incur fairly significant slowdown? Not sure. I guess I'll switch to memray for a bit?
I've already cut pytest out of the picture and wrote some code to manually call the tests, to make sure that's not at fault. (Yes, I'm that deep into conspiracy territory.) Memray just crashed initially, but that seems to be due to the Intel OpenCL ICD. If I disable it, things at least seem to run.
sigh It seems memray gets confused by GPUs and simply reports the 600 GiB of mapped GPU memory as memory consumption. :facepalm: (Yes, yes, I'm rebuilding pocl without GPUs.)
[Nvm all this, see below)
:tada: Memray did something useful!
Here's the relevant flamegraph: https://gist.github.com/inducer/2aaa320fdf49b3d5af651cbc28d5ee4d
It seems like most of the leaks are in pocl, with vkfft contributing a measly 100 MB. I'll recompile pocl with debug to try and get more precise info.
[Nvm all this, see below)
Here's the updated flamegraph with Pocl in debug mode: https://gist.github.com/inducer/bfaeb3a85025f3e7a747cba930e2477e
One thing that's not clear is that the bottom of most stacktraces shows memray. Does that mean that memray's allocations dominate, and that is what we've learned? Or is the presence of those stackframes just an artifact that we should ignore?
[Nvm all this, see below)
Continuing on the assumption that the "memray intercept stuff" is not the actual source of memory consumption.
If that's the case, then that flamegraph implicates two specific malloc
s in pocl:
Hmm... both of those bits of code look non-leaky to me, but both of them also return memory to a caller... so there's probably more going on.
[Nvm all this, see below)
The first of those sites also leaks when running https://gist.github.com/inducer/6c2a088eefeffd86cdd456e5aa244bec, a slightly modified PyOpenCL demo, on an empty cache, where it seems to leak 256 MB over the course of the runtime.
[Nvm all this, see below)
I've been staring at this for a while: both the PyOpenCL demo and the full-scale test with test_layer_pot.py
show multiple instance of temp_buf
escaping this function. I just can't for the life of me figure out how.
Never mind all that. TIL that memray flamegraph
has a --leaks
option that you need to use if you'd like to see leaks. :facepalm: (along with PYTHONMALLOC=malloc
when gathering)
The latest from memray: There is no leak! Here's the flamegraph: https://ssl.tiker.net/nextcloud/s/ayQRNNcx8MBexkS
So maybe it's time to come to terms with the idea that the leak, if there is one, isn't visible to malloc
? :shrug:
Here's a multi-process memory profile, from mprof, made with:
PYOPENCL_TEST=port:cpu mprof run -M python -m pytest -rsxw --durations=10 -n 4
It looks like it's mostly one subprocess/one test that hoovers up most of that memory...
So... maybe all that leak-hunting was not as useless as I had feared. Here's an mprof
of a single-process run:
PYOPENCL_TEST=port:cpu mprof run --python pytest -rsxw --durations=10
Noticeably, the one 12G bump still exists.
It looks like that 12G thing is test_skeletonize_by_proxy_convergence
. Memray told me. :tada:
Flamegraph: https://ssl.tiker.net/nextcloud/s/zyLA76DJ4C8adRa
It looks like that 12G thing is
test_skeletonize_by_proxy_convergence
. Memray told me. tadaFlamegraph: ssl.tiker.net/nextcloud/s/zyLA76DJ4C8adRa
Well that's quite impressive. I'll try to look at it over the weekend. It definitely should not do that :\
Hm, tried looking only at the skeletonization tests with
mprof run python -m pytest -v -s -m 'not slowtest' test/test_linalg_skeletonization.py
and there's definitely a big bump in memory due to that test. It's mainly due to constructing the full dense matrix here (a slightly bigger one than the one that is in test_matrix.py
)
https://github.com/inducer/pytential/blob/27d976d13671a6edfa8b7aeb57650d9b50bdc86e/test/test_linalg_skeletonization.py#L222-L229
Locally it only bumps up to about 2GB, not 12GB, so not sure what that's about. However, that matrix construction makes up most of the memory and the error computation (which uses that matrix to compute a bunch of block-wise matrix norms and a big SVD) makes up most of the runtime. I'll continue digging into it..
(blue is local and black is koelsch)
@inducer What did you run to get the results from https://github.com/inducer/pytential/issues/59#issuecomment-1655951767? I tried running on koelsch and it looks intimidatingly reasonable
From the command on the plot, it seems like you're also running the slowtest
tests? There are some 3D tests in the skeletonization that use ALL the memory, so maybe those are showing up? They shouldn't actually be run on the CI though..
I did not use -k 'not slowtest'
. :thinking: Have you pulled PyOpenCL? (That's the only difference between CI and these runs that I can think, of though I don't think https://github.com/inducer/pyopencl/commit/9931e14d496b7ce7f32a60b1cac13a7ef3e580b8 would make all that difference (between the CIs, which are presumably dying because of excessive memory allocation, and your run).
Also: let me rerun with the slow tests excluded.
I did not use
-k 'not slowtest'
. thinking Have you pulled PyOpenCL? (That's the only difference between CI and these runs that I can think, of though I don't think inducer/pyopencl@9931e14 would make all that difference.
Yeah, all of my runs have been on the latest git from all the stack and with a mamba update --all
a few days ago.
EDIT: And I think I cleared all the disk caches last week too.
Continued from https://gitlab.tiker.net/inducer/pytential/-/issues/131.