inducer commented 3 years ago

Continued from https://gitlab.tiker.net/inducer/pytential/-/issues/131.

inducer commented 3 years ago

Unlike for mirgecom, jemalloc doesn't seem to work wonders for pytential. This is memory use according to memory-profiler for a sequential run of python -m pytest:

grafik

I'm glad it's enough to unbreak the CI (#56, #58), but doesn't appear totally solved.

Using the machinery I made for mirgecom, I'll try to do a running total of CL buffer memory allocated and see if there's any growth there.

alexfikl commented 3 years ago

jemalloc seems to have some knobs to make it release memory more aggresively. There's a dirty_decay_ms and a muzzy_decay_ms. Should be easy to see what it does by setting

export MALLOC_CONF='dirty_decay_ms:5000,muzzy_decay_ms:5000'

(the default is 10000 (i.e. 10s) and smaller means it releases memory more)

Found this pretty useful too, but no idea how up to date it is.

inducer commented 3 years ago

Using this method for instrumenting memory allocation in pocl, I get the following plot of allocated memory (by pocl) over time: grafik

This seems to support the assertion that, as with mirgecom, the underlying cause for the growth of allocated memory might be memory fragmentation, and, as a result, the allocator requesting more and more memory from the OS. In particular, it appears that, at least from the point of view of OpenCL, all memory that is allocated is also freed.

For reproducibility:

PYOPENCL_TEST=port:pthread POCL_DEBUG=memory ~/shared/with-my-pocl python -m pytest  --tb=native -rxsw --capture=no . 2>out

run with 48efe3b and pocl a2d016c8.

inducer commented 3 years ago

Interestingly, replacing pocl's allocator for global memory with jemalloc doesn't seem to do much:

grafik

If anything, it releases even less memory. Maybe @alexfikl is right, and this comes down to a knob twiddling.

For reproducibility:

Jemalloc configured with

./autogen.sh --prefix=$HOME/shared/pack/jemalloc-build --with-jemalloc-prefix=jemalloc_ --disable-cxx --with-install-suffix=_prefixed --disable-initial-exec-tls

https://github.com/pocl/pocl/commit/a2d016c84d2034f43062d7f22b4874cfffe5c127 with this patch

Run with

PYOPENCL_TEST=port:pthread mprof run ~/shared/with-my-pocl python -m pytest --tb=native -rxsw

where with-my-pocl is this script.

inducer commented 3 years ago

As for things that are cheap to do, I discovered another allocator to try.

inducer commented 3 years ago

mimalloc (used globally) is not an improvement: grafik

inducer commented 3 years ago

Tried with jemalloc with the flags given by @alexfikl

MALLOC_CONF='dirty_decay_ms:5000,muzzy_decay_ms:5000' PYOPENCL_TEST=port:pthread LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so mprof run python -m pytest --tb=native -rxsw

Not really all that different from glibc or vanilla jemalloc: grafik

alexfikl commented 3 years ago

Just to add more random data to this. I ran

mprof run --python -T 0.01 pytest test/test_cost_model.py test/test_scalar_int_eq.py test/test_global_qbx.py test/test_target_specific_qbx.py

This only runs 4 tests, not the whole thing, so should probably taken with a grain of salt.

mprof

The different lines are

(black) is the default run.
(blue) should add export MALLOC_TRIM_THRESHOLD_=1024 (default is 128 * 1024).
(red) should be jemalloc.
(green) should be jemalloc with export MALLOC_CONF='abort:true,dirty_decay_ms:1000,muzzy_decay_ms:1000', so a bit more aggresive than setting them to 5000.

It does seem like changing some options will not have a big effect vs default jemalloc :\

alexfikl commented 3 years ago

I've also tried adding

def teardown_function(func):
    import gc
    gc.collect()

    import ctypes
    libc = ctypes.CDLL("libc.so.6")
    libc.malloc_trim(0)

At the top of those tests to force it to free as much as possible, but it didn't seem to improve much. However, it does seem to deallocate memory towards the end there, so it may not continue growing as all the tests run.

mprof

(black) is again the baseline.
(blue) is default jemalloc.
(red) is jemalloc with the snippet.
(green) is default malloc with the snippet.

alexfikl commented 3 years ago

Went ahead and ran the whole test suite with the snippet from previous comment added to all files.

def teardown_function(func):
    import gc
    gc.collect()

    import ctypes
    libc = ctypes.CDLL("libc.so.6")
    libc.malloc_trim(0)

mprof

(black) is a run with a cleared cache for code generation.
(blue) is just another run with cached code.

inducer commented 3 years ago

Thanks! One additional bit of information: The Python GC already gets run by pyopencl (I think) between tests.

alexfikl commented 3 years ago

Thanks! One additional bit of information: The Python GC already gets run by pyopencl.

Ah, that explains why I didn't see much of a difference when removing the collect at some point!

inducer commented 3 years ago

Had an OpenCL-related theory for something that could be contributing: https://github.com/inducer/pyopencl/issues/450

inducer commented 1 year ago

python test_layer_pot.py

with this diff

diff --git a/test/test_layer_pot.py b/test/test_layer_pot.py
index 409eede..a67bb78 100644
--- a/test/test_layer_pot.py
+++ b/test/test_layer_pot.py
@@ -614,15 +614,17 @@ def test_3d_jump_relations(actx_factory, relation, visualize=False):
 # }}}

+def run_em_all():
+    import gc
+    for i in range(10):
+        test_off_surface_eval(_acf, True)
+        gc.collect()
+
+
 # You can test individual routines by typing
 # $ python test_layer_pot.py 'test_routine()'

 if __name__ == "__main__":
-    import sys
-    if len(sys.argv) > 1:
-        exec(sys.argv[1])
-    else:
-        from pytest import main
-        main([__file__])
+    run_em_all()

 # vim: fdm=marker

seems to be a microcosm of this issue that runs in manageable time. Memory usage goes up in a straight line, when... there's no reason for that, IMO:

grafik

inducer commented 1 year ago

The allocations seem to scale with the size of the problem (in this case, set by changing npoints=3000 on line 116), indicating that it's at least some amount of CL memory that's going unfreed. Interestingly, some deallocations do turn up, but in the plot at that size, but there's still a big bunch of leakage:

grafik

inducer commented 1 year ago

I'm using the code in this diff to hunt the leaks. One (possibly minor?) contributor to this appears to be ProcessLogger, which spawns a bunch of threads whose resources appear to never be reaped.

inducer commented 1 year ago

That was a reference cycle, but not a significant contributor: https://github.com/inducer/pytools/pull/181.

inducer commented 1 year ago

Beyond that, it's mostly a bunch of pyopencl "stuff" that stays alive:

<class 'list'>, <class 'list'>, <class 'set'>, <class 'pyopencl._cl.Context'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'dict'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'frozenset'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'pyopencl.scan.GenericScanKernel'>, <class 'list'>, <class 'pyopencl._cl.Device'>, <class 'list'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl.scan._BuiltScanKernelInfo'>, <class 'pyopencl.scan._BuiltScanKernelInfo'>, <class 'pyopencl.scan._BuiltFinalUpdateKernelInfo'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl._cl.Platform'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl._cl.Kernel'>, <class 'tuple'>, <class 'list'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'tuple'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'tuple'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'tuple'>, <class 'tuple'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'tuple'>, <class 'pyopencl.elementwise.ElementwiseKernel'>, <class 'dict'>, <class 'dict'>, <class 'pyopencl.reduction.ReductionKernel'>, <class 'tuple'>, <class 'tuple'>, <class 'list'>, <class 'tuple'>, <class 'list'>, <class 'pyopencl._cl.Kernel'>, <class 'pyopencl.reduction.ReductionKernel'>, <class 'tuple'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.ScalarArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, <class 'pyopencl.tools.VectorArg'>, ...`

inducer commented 1 year ago

Hm. Calling pyopencl.tools.clear_first_arg_caches in each loop makes most of those go away (along with substantially flattening the mprof curve), so this is probably as expected. One thought is that running pytest in parallel may mean that pyopencl.tools.clear_first_arg_caches is not getting called as intended. (from here)

inducer commented 1 year ago

Irritatingly, even with zero growth of Python-side object allocation, the test_off_surface_eval loop still manages to lose about 20MB over 30 runs.

inducer commented 1 year ago

OK, it's flat after switching to jemalloc, so that's probably not our problem. Switching to the interaction of pytest with PyOpenCL's first-arg caches as the prime suspect.

inducer commented 1 year ago

Well. So I removed that return from the middle of test_off_surface_eval, and suddenly we're leaking a bunch of loopy stuff. This may just be the caches though, will keep digging.

inducer commented 1 year ago

Observations:

New LoopKernels show up in new_objects in each iteration. They're all for the FMM translations. Why are they there?
Does the array context's caching of einsum programs work?
Loopy's kernel executor mechanism is a tangle of circular references. It needs to be untangled, to avoid having it keep CL contexts alive. I wonder if it'd be better to introduce a separate object holding a reference to a CL context and a loopy kernel, and having that be callable.
clear_first_arg_caches is getting called, from here.

inducer commented 1 year ago

https://github.com/inducer/loopy/pull/797 (plus small changes across the stack) should make it so that loopy caches don't keep contexts alive.

inducer commented 1 year ago

grafik

Somewhat depressingly, https://github.com/inducer/loopy/pull/797 doesn't seem to have made a decisive difference. Blue is with the changes, black is before.

inducer commented 1 year ago

Been on the hunt for other caches that keep things alive. They contribute, but they're all small potatoes:

grafik

inducer commented 1 year ago

I've also printed lists of Python objects that, according to gc, survive each test. Here's a flavor:

138590, 62 diff, 119 new
[(<class 'builtin_function_or_method'>, 16), (<class 'weakref.ReferenceType'>, 15), (<class 'list'>, 14), (<class 'dict'>, 14), (<class 'cell'>, 10), (<class 'tuple'>, 8), (<class 'generator'>, 6), (<class 'method'>, 5), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class '_pytest.reports.TestReport'>, 3), (<class 'list_reverseiterator'>, 2), (<class 'function'>, 2), (<class 'pytools.persistent_dict._LinkedList'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'set'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class '_io.StringIO'>, 1), (<class 'pytools.persistent_dict._LRUCache'>, 1), (<class 'contextlib._GeneratorContextManager'>, 1)]
.
138728, 138 diff, 195 new
[(<class 'function'>, 39), (<class 'tuple'>, 31), (<class 'dict'>, 25), (<class 'cell'>, 22), (<class 'list'>, 11), (<class 'method'>, 8), (<class 'property'>, 8), (<class 'generator'>, 6), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'pytools.persistent_dict._LRUCache'>, 4), (<class 'type'>, 4), (<class 'weakref.ReferenceType'>, 4), (<class 'getset_descriptor'>, 4), (<class '_pytest.reports.TestReport'>, 2), (<class 'set'>, 1), (<class 'builtin_function_or_method'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'warnings.WarningMessage'>, 1), (<class 'list_reverseiterator'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class '_io.StringIO'>, 1), (<class 'module'>, 1), (<class 'logging.Logger'>, 1), (<class '_frozen_importlib_external.SourceFileLoader'>, 1), (<class '_frozen_importlib.ModuleSpec'>, 1), (<class 'DeprecationWarning'>, 1)]
l_inf error: 0.0025745079426662495
.
138820, 92 diff, 157 new
[(<class 'list'>, 49), (<class 'tuple'>, 20), (<class 'dict'>, 15), (<class 'cell'>, 13), (<class 'method'>, 10), (<class 'pytools.persistent_dict._LinkedList'>, 7), (<class 'pytools.persistent_dict._LRUCache'>, 5), (<class 'pymbolic.primitives.Variable'>, 4), (<class 'function'>, 3), (<class 'pymbolic.primitives.Product'>, 3), (<class 'set'>, 2), (<class 'list_reverseiterator'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class 'generator'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'pymbolic.primitives.Call'>, 2), (<class 'pymbolic.primitives.Subscript'>, 2), (<class '_pytest.terminal.WarningReport'>, 1), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'sumpy.expansion.LinearPDEBasedExpansionTermsWrangler'>, 1), (<class 'sumpy.expansion.CSEMatVecOperator'>, 1), (<class '_pytest.reports.TestReport'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1), (<class 'sumpy.kernel.LaplaceKernel'>, 1), (<class 'pymbolic.primitives.Quotient'>, 1), (<class 'pymbolic.primitives.Sum'>, 1)]
.
139330, 510 diff, 568 new
[(<class 'tuple'>, 83), (<class 'function'>, 79), (<class 'dict'>, 64), (<class 'weakref.ReferenceType'>, 40), (<class 'cell'>, 36), (<class 'builtin_function_or_method'>, 31), (<class 'list'>, 28), (<class 'getset_descriptor'>, 20), (<class 'wrapper_descriptor'>, 13), (<class 'method_descriptor'>, 13), (<class 'pymbolic.primitives.Variable'>, 12), (<class 'set'>, 11), (<class 'module'>, 10), (<class 'type'>, 10), (<class '_frozen_importlib.ModuleSpec'>, 10), (<class 'method'>, 9), (<class '_frozen_importlib_external.SourceFileLoader'>, 8), (<class 'frozenset'>, 7), (<class 'pyopencl.tools._SimpleTextTemplate'>, 6), (<class 'pytools.log_process'>, 6), (<class 'property'>, 5), (<class 'pytential.qbx.target_specific.impl.Enum'>, 5), (<class 'pytools.persistent_dict._LinkedList'>, 5), (<class 'pymbolic.primitives.Call'>, 4), (<class 'pymbolic.primitives.Subscript'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class 'generator'>, 3), (<class 'staticmethod'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3)]
.
139453, 123 diff, 178 new
[(<class 'builtin_function_or_method'>, 33), (<class 'weakref.ReferenceType'>, 32), (<class 'tuple'>, 18), (<class 'dict'>, 14), (<class 'list'>, 12), (<class 'method'>, 9), (<class 'pymbolic.primitives.Variable'>, 6), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class 'generator'>, 5), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class 'set'>, 3), (<class '_pytest.terminal.WarningReport'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3), (<class '_pytest.reports.TestReport'>, 2), (<class 'function'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class 'cell'>, 2), (<class 'pymbolic.primitives.Call'>, 2), (<class 'frozenset'>, 2), (<class 'pymbolic.primitives.Subscript'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'functools.partial'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class 'warnings.catch_warnings'>, 1), (<class 'pymbolic.primitives.Sum'>, 1), (<class 'pymbolic.primitives.Product'>, 1), (<class 'list_reverseiterator'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1)]
.
139482, 29 diff, 94 new
[(<class 'list'>, 15), (<class 'dict'>, 11), (<class 'cell'>, 10), (<class 'method'>, 9), (<class 'tuple'>, 8), (<class 'pytools.persistent_dict._LinkedList'>, 6), (<class '_pytest.terminal.WarningReport'>, 4), (<class 'warnings.WarningMessage'>, 4), (<class 'DeprecationWarning'>, 4), (<class '_pytest.reports.TestReport'>, 3), (<class 'generator'>, 3), (<class 'pytools.persistent_dict._LRUCache'>, 3), (<class 'list_reverseiterator'>, 2), (<class 'contextlib._GeneratorContextManager'>, 2), (<class '_io.StringIO'>, 2), (<class '_pytest.logging.catching_logs'>, 2), (<class 'set'>, 1), (<class 'functools.partial'>, 1), (<class 'function'>, 1), (<class '_pytest.fixtures.SubRequest'>, 1), (<class '_pytest.threadexception.catch_threading_exception'>, 1), (<class '_pytest.unraisableexception.catch_unraisable_exception'>, 1)]

I'm increasingly coming to the conclusion that these leaks might not be Python-visible---these objects are unlikely to explain hundreds of megabytes, I think.

inducer commented 1 year ago

Further confirmation of this is that tracemalloc (what a nice tool!) seems to see none of the offending allocations.

isuruf commented 1 year ago

Can you try with SUMPY_FFT_BACKEND=loopy?

inducer commented 1 year ago

Thanks for following along! :grinning:

SUMPY_FFT_BACKEND=loopy takes longer, but otherwise doesn't seem to make a big difference (it's the black curve):

grafik

alexfikl commented 1 year ago

Does something like https://github.com/bloomberg/memray give additional useful information? I've never actually tried it..

EDIT: Possibly even https://github.com/bloomberg/pytest-memray?

inducer commented 1 year ago

I was trying https://github.com/plasma-umass/scalene just now, but it seemed to incur fairly significant slowdown? Not sure. I guess I'll switch to memray for a bit?

inducer commented 1 year ago

I've already cut pytest out of the picture and wrote some code to manually call the tests, to make sure that's not at fault. (Yes, I'm that deep into conspiracy territory.) Memray just crashed initially, but that seems to be due to the Intel OpenCL ICD. If I disable it, things at least seem to run.

inducer commented 1 year ago

sigh It seems memray gets confused by GPUs and simply reports the 600 GiB of mapped GPU memory as memory consumption. :facepalm: (Yes, yes, I'm rebuilding pocl without GPUs.)

inducer commented 1 year ago

[Nvm all this, see below)

:tada: Memray did something useful!

Here's the relevant flamegraph: https://gist.github.com/inducer/2aaa320fdf49b3d5af651cbc28d5ee4d

It seems like most of the leaks are in pocl, with vkfft contributing a measly 100 MB. I'll recompile pocl with debug to try and get more precise info.

inducer commented 1 year ago

[Nvm all this, see below)

Here's the updated flamegraph with Pocl in debug mode: https://gist.github.com/inducer/bfaeb3a85025f3e7a747cba930e2477e

One thing that's not clear is that the bottom of most stacktraces shows memray. Does that mean that memray's allocations dominate, and that is what we've learned? Or is the presence of those stackframes just an artifact that we should ignore?

inducer commented 1 year ago

[Nvm all this, see below)

Continuing on the assumption that the "memray intercept stuff" is not the actual source of memory consumption.

If that's the case, then that flamegraph implicates two specific mallocs in pocl:

Hmm... both of those bits of code look non-leaky to me, but both of them also return memory to a caller... so there's probably more going on.

inducer commented 1 year ago

[Nvm all this, see below)

The first of those sites also leaks when running https://gist.github.com/inducer/6c2a088eefeffd86cdd456e5aa244bec, a slightly modified PyOpenCL demo, on an empty cache, where it seems to leak 256 MB over the course of the runtime.

inducer commented 1 year ago

[Nvm all this, see below)

I've been staring at this for a while: both the PyOpenCL demo and the full-scale test with test_layer_pot.py show multiple instance of temp_buf escaping this function. I just can't for the life of me figure out how.

inducer commented 1 year ago

Never mind all that. TIL that memray flamegraph has a --leaks option that you need to use if you'd like to see leaks. :facepalm: (along with PYTHONMALLOC=malloc when gathering)

inducer commented 1 year ago

The latest from memray: There is no leak! Here's the flamegraph: https://ssl.tiker.net/nextcloud/s/ayQRNNcx8MBexkS

So maybe it's time to come to terms with the idea that the leak, if there is one, isn't visible to malloc? :shrug:

inducer commented 1 year ago

214 and PRs linked therein is where this journey ends, for now.

inducer commented 1 year ago

Here's a multi-process memory profile, from mprof, made with:

PYOPENCL_TEST=port:cpu mprof run -M python -m pytest -rsxw --durations=10 -n 4

grafik

It looks like it's mostly one subprocess/one test that hoovers up most of that memory...

inducer commented 1 year ago

So... maybe all that leak-hunting was not as useless as I had feared. Here's an mprof of a single-process run:

 PYOPENCL_TEST=port:cpu mprof run --python pytest -rsxw --durations=10

grafik

Noticeably, the one 12G bump still exists.

inducer commented 1 year ago

It looks like that 12G thing is test_skeletonize_by_proxy_convergence. Memray told me. :tada:

Flamegraph: https://ssl.tiker.net/nextcloud/s/zyLA76DJ4C8adRa

alexfikl commented 1 year ago

It looks like that 12G thing is test_skeletonize_by_proxy_convergence. Memray told me. tada

Flamegraph: ssl.tiker.net/nextcloud/s/zyLA76DJ4C8adRa

Well that's quite impressive. I'll try to look at it over the weekend. It definitely should not do that :\

alexfikl commented 1 year ago

Hm, tried looking only at the skeletonization tests with

mprof run python -m pytest -v -s -m 'not slowtest' test/test_linalg_skeletonization.py

and there's definitely a big bump in memory due to that test. It's mainly due to constructing the full dense matrix here (a slightly bigger one than the one that is in test_matrix.py) https://github.com/inducer/pytential/blob/27d976d13671a6edfa8b7aeb57650d9b50bdc86e/test/test_linalg_skeletonization.py#L222-L229

Locally it only bumps up to about 2GB, not 12GB, so not sure what that's about. However, that matrix construction makes up most of the memory and the error computation (which uses that matrix to compute a bunch of block-wise matrix norms and a big SVD) makes up most of the runtime. I'll continue digging into it..

memory

(blue is local and black is koelsch)

alexfikl commented 1 year ago

@inducer What did you run to get the results from https://github.com/inducer/pytential/issues/59#issuecomment-1655951767? I tried running on koelsch and it looks intimidatingly reasonable

memory

From the command on the plot, it seems like you're also running the slowtest tests? There are some 3D tests in the skeletonization that use ALL the memory, so maybe those are showing up? They shouldn't actually be run on the CI though..

inducer commented 1 year ago

I did not use -k 'not slowtest'. :thinking: Have you pulled PyOpenCL? (That's the only difference between CI and these runs that I can think, of though I don't think https://github.com/inducer/pyopencl/commit/9931e14d496b7ce7f32a60b1cac13a7ef3e580b8 would make all that difference (between the CIs, which are presumably dying because of excessive memory allocation, and your run).

Also: let me rerun with the slow tests excluded.

alexfikl commented 1 year ago

I did not use -k 'not slowtest'. thinking Have you pulled PyOpenCL? (That's the only difference between CI and these runs that I can think, of though I don't think inducer/pyopencl@9931e14 would make all that difference.

Yeah, all of my runs have been on the latest git from all the stack and with a mamba update --all a few days ago.

EDIT: And I think I cleared all the disk caches last week too.

inducer / pytential

Uncontrolled process memory size during test #59

214 and PRs linked therein is where this journey ends, for now.