Add mimalloc memory allocator

tiran commented 2 years ago

BPO	46657
Nosy	@nascheme, @tiran, @corona10, @erlend-aasland, @h-vetinari
PRs	python/cpython#31164

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['interpreter-core', 'type-feature', '3.11'] title = 'Add mimalloc memory allocator' updated_at = user = 'https://github.com/tiran' ``` bugs.python.org fields: ```python activity = actor = 'h-vetinari' assignee = 'none' closed = False closed_date = None closer = None components = ['Interpreter Core'] creation = creator = 'christian.heimes' dependencies = [] files = [] hgrepos = [] issue_num = 46657 keywords = ['patch'] message_count = 12.0 messages = ['412639', '412641', '412645', '412654', '412669', '412679', '412741', '412749', '412796', '412835', '412867', '412986'] nosy_count = 5.0 nosy_names = ['nascheme', 'christian.heimes', 'corona10', 'erlendaasland', 'h-vetinari'] pr_nums = ['31164'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue46657' versions = ['Python 3.11'] ```

Linked PRs

gh-109914
gh-111522
gh-111524
gh-111527
gh-111528
gh-111532
gh-117811

tiran commented 2 years ago

From https://github.com/microsoft/mimalloc

mimalloc (pronounced "me-malloc") is a general purpose allocator with excellent performance characteristics. Initially developed by Daan Leijen for the run-time systems of the Koka and Lean languages.

mimalloc has several interesting properties that make it useful for CPython. Amongst other it is fast, thread-safe, and NUMA-aware. It has built-in free lists with multi-sharding and allocation heaps. While Python's obmalloc requires the GIL to protect its data structures, mimalloc uses mostly thread-local and atomic instructions (compare-and-swap) for efficiency. Sam Gross' nogil relies on mimalloc's thread safety and uses first-class heaps for heap walking GC.

mimalloc works on majority of platforms and CPU architectures. However it requires a compiler with C11 atomics support. CentOS 7's default GCC is slightly too old, more recent GCC from Developer Toolset is required.

For 3.11 I plan to integrate mimalloc as an optional drop-in replacement for obmalloc. Users will be able to compile CPython without mimalloc or disable mimalloc with PYTHONMALLOC env var. Since mimalloc will be optional in 3.11, Python won't depend or expose on any of the advanced features yet. The approach enables the community to test and give feedback with minimal risk of breakage.

mimalloc sources will vendored without any option to use system libraries. Python's mimalloc requires several non-standard compile-time flags. In the future Python may extend or modify mimalloc for heap walking and nogil, too.

(This is a tracking bug until I find time to finish a PEP.)

corona10 commented 2 years ago

I add Neil to the nosy list since he is one of the kick-off members with this amazing works :)

tiran commented 2 years ago

New features:

vendored mimalloc 2.0.3 + two patches from mimalloc dev branch. Mimalloc is embedded in obmalloc.o. Symbols are either hidden or names are mangled to have a _Py_ prefix.
./configure --with[out]-mimalloc (default: yes), fails if atomics are not available.
PYTHONMALLOC=mimalloc, PYTHONMALLOC=mimalloc-debug env var settings
PYMEM_ALLOCATOR_MIMALLOC, PYMEM_ALLOCATOR_MIMALLOC_DEBUG
sys.debugmallocstats() and _PyObject_DebugMallocStats() prints mimalloc stats
sys._malloc_info struct, contains information about available and current allocator

tiran commented 2 years ago

Buildbots "PPC64 Fedora PR" and all RHEL 7 build bots provided by David Edelsohn are failing because compiler is missing support for stdatomic.h.

nascheme commented 2 years ago

Thanks, I'm indeed interested. Most credit goes to Christian for advancing this.

For the missing stdatomic.h, would it be appropriate to have an autoconfig check for it? Can just disable mimalloc if it doesn't exist.

tiran commented 2 years ago

We have an autoconf check for stdatomic.h. The test even verifies that a program with atomic_load_explicit() compiles and links.

How do we want to use mimalloc in the future? Is it going to stay optional in 3.12? Then the default setting for --with-mimalloc should depend on presence of stdatomic.h. Do we want to make it mandatory for GC heap walking and nogil? Then --with-mimalloc should default to "yes" and configure should abort when stdatomic.h is missing.

I'm leaning towards --with-mimalloc=yes. It will make users aware that they need a compiler with atomics:

configure: error: --with-mimalloc requires stdatomic.h. Update your compiler or rebuild with --without-mimalloc. Python 3.12 will require stdatomic.

tiran commented 2 years ago

ICC might be a problem. Apparently some version have an incomplete stdatomic.h, see bpo-37415.

tiran commented 2 years ago

References:

C11 stdatomic.h https://en.cppreference.com/w/c/atomic
mimalloc-atomic.h https://github.com/microsoft/mimalloc/blob/master/include/mimalloc-atomic.h
MSVC Interlocked Variable Access https://docs.microsoft.com/de-de/windows/win32/sync/interlocked-variable-access

nascheme commented 2 years ago

My preference would be for --with-mimalloc=yes in an upcoming release. For platforms without the required stdatomic.h stuff, they can manually specify --with-mimalloc=no. That will make them aware that a future release of Python might no longer build (if mimalloc is no longer optional).

A soft-landing for merging nogil is not a good enough reason to merge mimalloc, IMHO. nogil may never be merged. There should be some concrete and immediate advantage to switch to mimalloc. The idea of using the "heap walking" to improve is cyclic GC is not concrete enough. It's just an idea at this point.

I think the (small) performance win could be enough of a reason to merge. This seems to be the most recent benchmark:

https://gist.github.com/pablogsal/8027937b71cd30f17aaaa5ef7c885d3e

There is also the long-term maintenance issue. So far, mimalloc upstream has been responsive. The mimalloc code is not so huge or complicated that we couldn't maintain it (if for some reason it gets abandoned upstream). However, I think we would prefer to maintain obmalloc rather than mimalloc, all else being equal. Abandonment by the upstream seems fairly unlikely. So, I'm not too concerned about maintenance.

tiran commented 2 years ago

New benchmark:

Benchmark	2022-02-08_11-54-master-69e10976b2e7	2022-02-08_11-57-master-d6f5f010b586
mako	8.85 ms	7.83 ms: 1.13x faster
hexiom	6.04 ms	5.54 ms: 1.09x faster
spectral_norm	81.4 ms	75.2 ms: 1.08x faster
pyflate	380 ms	352 ms: 1.08x faster
scimark_sparse_mat_mult	4.05 ms	3.76 ms: 1.08x faster
pickle_pure_python	312 us	290 us: 1.07x faster
unpickle_pure_python	238 us	222 us: 1.07x faster
float	63.1 ms	59.5 ms: 1.06x faster
tornado_http	90.3 ms	86.0 ms: 1.05x faster
html5lib	62.8 ms	60.2 ms: 1.04x faster
regex_compile	121 ms	116 ms: 1.04x faster
scimark_lu	106 ms	102 ms: 1.04x faster
nqueens	70.9 ms	68.4 ms: 1.04x faster
crypto_pyaes	70.1 ms	67.8 ms: 1.03x faster
logging_silent	97.5 ns	94.4 ns: 1.03x faster
sympy_integrate	17.2 ms	16.7 ms: 1.03x faster
sympy_str	260 ms	252 ms: 1.03x faster
sympy_expand	441 ms	427 ms: 1.03x faster
pathlib	14.1 ms	13.7 ms: 1.03x faster
regex_dna	164 ms	159 ms: 1.03x faster
regex_v8	21.1 ms	20.6 ms: 1.02x faster
sympy_sum	138 ms	136 ms: 1.02x faster
scimark_fft	286 ms	281 ms: 1.02x faster
pickle	9.34 us	9.19 us: 1.02x faster
xml_etree_parse	126 ms	124 ms: 1.01x faster
richards	43.0 ms	42.4 ms: 1.01x faster
xml_etree_generate	71.2 ms	70.5 ms: 1.01x faster
scimark_monte_carlo	58.8 ms	58.3 ms: 1.01x faster
deltablue	3.60 ms	3.58 ms: 1.01x faster
chaos	64.6 ms	64.3 ms: 1.01x faster
2to3	216 ms	215 ms: 1.00x faster
pidigits	155 ms	154 ms: 1.00x faster
nbody	76.4 ms	77.0 ms: 1.01x slower
python_startup_no_site	3.96 ms	3.99 ms: 1.01x slower
xml_etree_iterparse	82.5 ms	83.1 ms: 1.01x slower
scimark_sor	103 ms	104 ms: 1.01x slower
unpickle	11.3 us	11.4 us: 1.01x slower
telco	5.53 ms	5.58 ms: 1.01x slower
python_startup	5.56 ms	5.62 ms: 1.01x slower
json_loads	20.6 us	20.8 us: 1.01x slower
json_dumps	9.61 ms	9.77 ms: 1.02x slower
dulwich_log	60.9 ms	62.1 ms: 1.02x slower
logging_format	5.47 us	5.62 us: 1.03x slower
pickle_list	3.06 us	3.15 us: 1.03x slower
django_template	30.2 ms	31.2 ms: 1.03x slower
meteor_contest	80.7 ms	84.1 ms: 1.04x slower
pickle_dict	21.9 us	23.4 us: 1.07x slower
logging_simple	4.84 us	5.20 us: 1.07x slower
Geometric mean	(ref)	1.01x faster

Benchmark hidden because not significant (9): unpack_sequence, go, raytrace, chameleon, xml_etree_process, fannkuch, sqlite_synth, regex_effbot, unpickle_list

tiran commented 2 years ago

I re-ran the benchmark of d6f5f010b586:

Benchmark	2022-02-08_11-54-master-69e10976b2e7	2022-02-08_11-57-master-d6f5f010b586
pickle_pure_python	312 us	281 us: 1.11x faster
unpickle_pure_python	238 us	216 us: 1.10x faster
pyflate	380 ms	349 ms: 1.09x faster
hexiom	6.04 ms	5.55 ms: 1.09x faster
logging_silent	97.5 ns	89.8 ns: 1.09x faster
float	63.1 ms	59.3 ms: 1.07x faster
html5lib	62.8 ms	59.1 ms: 1.06x faster
crypto_pyaes	70.1 ms	66.1 ms: 1.06x faster
json_loads	20.6 us	19.4 us: 1.06x faster
tornado_http	90.3 ms	86.1 ms: 1.05x faster
mako	8.85 ms	8.45 ms: 1.05x faster
richards	43.0 ms	41.1 ms: 1.05x faster
xml_etree_parse	126 ms	120 ms: 1.05x faster
logging_format	5.47 us	5.25 us: 1.04x faster
sympy_integrate	17.2 ms	16.5 ms: 1.04x faster
sympy_str	260 ms	251 ms: 1.04x faster
fannkuch	325 ms	314 ms: 1.04x faster
regex_v8	21.1 ms	20.4 ms: 1.04x faster
sympy_expand	441 ms	425 ms: 1.04x faster
regex_compile	121 ms	117 ms: 1.03x faster
sympy_sum	138 ms	134 ms: 1.03x faster
scimark_lu	106 ms	103 ms: 1.03x faster
go	128 ms	125 ms: 1.03x faster
pathlib	14.1 ms	13.7 ms: 1.02x faster
scimark_monte_carlo	58.8 ms	57.9 ms: 1.02x faster
nqueens	70.9 ms	69.9 ms: 1.02x faster
pidigits	155 ms	153 ms: 1.01x faster
pickle	9.34 us	9.22 us: 1.01x faster
raytrace	278 ms	275 ms: 1.01x faster
2to3	216 ms	213 ms: 1.01x faster
deltablue	3.60 ms	3.56 ms: 1.01x faster
logging_simple	4.84 us	4.78 us: 1.01x faster
xml_etree_iterparse	82.5 ms	81.7 ms: 1.01x faster
regex_dna	164 ms	162 ms: 1.01x faster
unpack_sequence	32.7 ns	32.4 ns: 1.01x faster
telco	5.53 ms	5.48 ms: 1.01x faster
python_startup	5.56 ms	5.58 ms: 1.00x slower
xml_etree_generate	71.2 ms	71.6 ms: 1.01x slower
unpickle_list	4.08 us	4.12 us: 1.01x slower
chameleon	6.07 ms	6.14 ms: 1.01x slower
chaos	64.6 ms	65.3 ms: 1.01x slower
json_dumps	9.61 ms	9.75 ms: 1.01x slower
xml_etree_process	49.9 ms	50.7 ms: 1.01x slower
meteor_contest	80.7 ms	82.0 ms: 1.02x slower
scimark_sparse_mat_mult	4.05 ms	4.12 ms: 1.02x slower
unpickle	11.3 us	11.5 us: 1.02x slower
django_template	30.2 ms	31.0 ms: 1.02x slower
scimark_sor	103 ms	106 ms: 1.02x slower
spectral_norm	81.4 ms	84.9 ms: 1.04x slower
pickle_dict	21.9 us	23.5 us: 1.08x slower
Geometric mean	(ref)	1.02x faster

Benchmark hidden because not significant (7): scimark_fft, dulwich_log, python_startup_no_site, regex_effbot, sqlite_synth, nbody, pickle_list

tiran commented 2 years ago

ICC 2021 has full support for stdatomic.h and compiles mimalloc just fine:

$ CC="icc" ./configure -C --with-pydebug
$ make
$ ./python
Python 3.11.0a5+ (main, Feb  9 2022, 15:57:40) [GCC Intel(R) C++ gcc 7.5 mode] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys._malloc_info
sys._malloc_info(allocator='mimalloc_debug', with_pymalloc=True, with_mimalloc=True, mimalloc_secure=4, mimalloc_debug=2)

AIX xlc is still a problem. It does not support C11 stdatomic.h. But it comes with older GCC atomic memory access __sync function family, https://www.ibm.com/docs/en/xl-c-and-cpp-aix/13.1.3?topic=cbif-gcc-atomic-memory-access-built-in-functions-extension . It might be possible to re-implement mimalloc's atomics with __sync functions (e.g. https://gist.github.com/nhatminhle/5181506). The implementation would be less efficient, though. The __sync functions don't have memory order, atomic_loadexplicit(v) becomes \_sync_fetch_and_add(v, 0), and atomic_store_explicit() requires two full memory barriers.

ericsnowcurrently commented 2 years ago

FYI, PR #31164 indicates the following 3 blockers:

mimalloc is holding on to large blocks of memory (thanks @neonene)
the ASAN CI job is failing
PGO on Windows is failing

After a conversation between @tiran, @daanx (the mimalloc creator), and me, we have a good idea on how to move forward. I'm looking into the PGO issue (with some help).

ericsnowcurrently commented 2 years ago

The PGO-on-Windows issue may be resolved now (thanks to gh-94790). Here's what I did to check:

merge main into tiran:bpo-46657-mimalloc (PR gh-31164) a. run git merge main b. fix Makefile.pre.in c. run git checkout main -- Python/clinic/sysmodule.c.h
run python3 Tools/clinic/clinic.py --make
run python3 Tools/scripts/generate_global_objects.py
change _PyStructSequence_InitType to _PyStructSequence_InitBuiltinWithFlags in Python/sysmodule.c
run .\PCbuild\build.bat --pgo

@neonene Could you verify that we should expect PGO builds to be working correctly now? Thanks!

gvanrossum commented 2 years ago

Do these steps point towards improvements for @tiran's mimalloc branch?

ericsnowcurrently commented 2 years ago

Only that the branch needs to merge in main.

neonene commented 2 years ago

@ericsnowcurrently Thanks for the check. I can also build the newest mimalloc branch successfully with PGO on Windows.

As for the issue around releasing memory, my builds still show:

>>> from test.support import Py_DEBUG

>>> about32MiB = 1024*1024*32 - (32*Py_DEBUG)

>>> b = [bytearray(about32MiB) for i in range(3)]
>>> del b  # 100MB is freed to the OS

>>> b = [bytearray(about32MiB - 1) for i in range(3)]
>>> del b  # 100MB remains cached

>>> sys._mi_collect(True)  # purge cached segments

The threshold size was 1024*1024*32 - (32*Py_DEBUG) - 32 when I used bytes() instead.

ericsnowcurrently commented 2 years ago

@ericsnowcurrently Thanks for the check. I can also build the newest mimalloc branch successfully with PGO on Windows.

Great! I'll cross it off my list. 😄

As for the issue around releasing memory, my builds still show:

Yeah, I'm going to follow up with @daanx about this.

romange commented 2 years ago

subscribing here to learn how the issue with mimalloc is being resolved. My project uses mimalloc as well, and I am curious to learn how to configure it to have a slimmer memory footprint.

ericsnowcurrently commented 2 years ago

@romange, take a look at https://github.com/python/cpython/pull/31164#issuecomment-1133583811. Other than that, there's only the existing mimalloc documentation, as well as any insight @daanx can offer.

romange commented 2 years ago

@ericsnowcurrently thanks! btw, I chose mimalloc because of its elegant design and its thread-local heap support. However, it seems that the project is not under active development anymore. Maybe it's because of good reasons and everything is perfect, but I can see how it can become a problem in the future (and daanx is a one-person army). What do you think about it?

daanx commented 2 years ago

Hi @romange, here is me: just to clarify mimalloc is actively maintained and I am dedicated to make the CPython integration work. However, I did take some needed time off this summer, :-) but am back now and hope to help resolve the outstanding issues. I already started work on the ASAN support so hope that is coming soon (although I am at ICFP'23 next week).

romange commented 2 years ago

Hi @daanx I am psyched to hear! :) and sorry for quickly jumping to conclusions 🤦🏼‍♂️

daanx commented 2 years ago

@ericsnowcurrently Thanks for the check. I can also build the newest mimalloc branch successfully with PGO on Windows.

As for the issue around releasing memory, my builds still show:
>>> from test.support import Py_DEBUG

>>> about32MiB = 1024*1024*32 - (32*Py_DEBUG)

>>> b = [bytearray(about32MiB) for i in range(3)]
>>> del b  # 100MB is freed to the OS

>>> b = [bytearray(about32MiB - 1) for i in range(3)]
>>> del b  # 100MB remains cached

>>> sys._mi_collect(True)  # purge cached segments
The threshold size was 1024*1024*32 - (32*Py_DEBUG) - 32 when I used bytes() instead.

Thanks for testing this, just to clarify:

after the sys.mi_collect(True), was there still 100MiB being "cached"
with "cached", you mean that the RSS was 100MiB more than expected? (was this on Linux or Windows?)

Generally, when an allocator requests memory from the OS (when allocating the bytearray for example), it should not immediately free that OS memory when memory is free'd just in case the program will start allocating again (which is usually the case). mimalloc does this as well and only releases more unused OS memory after a while.

However, at an interpreter prompt the program is just waiting and may therefore not get an opportunity to do this. In such case the CPython interpreter should call sys._mi_collect(True) to collect outstanding OS memory just before blocking for user input.

(In case you still see an increased usage after manually calling this as in your test above, something else may be going on which we should investigate in that case).

neonene commented 2 years ago

When I tried the following steps on Windows:

b = [bytearray(1024*1024*32 - 1) for i in range(32)]
del b
sys._mi_collect(True)

Then, TaskManager showed that the memory usages of `python.exe` are:	65af44c (PR)	1	2.del	3.collect
v2.0.6 as-is (options)	1 GB	1 GB	40,932 KB	holds 32MB?
v2.0.6 decommit_delay = 0	1 GB	8,392 KB	8,112 KB
v2.0.6 page_reset = 1	1 GB	6,480 KB	8,060 KB	slower than above?

main	1	2.del
pymalloc	1 GB	3,760 K

daanx commented 2 years ago

Thanks for the quick test and explanation!

I think generally, we should call mi_collect(true) in CPython when entering the interpreter prompt just in general so we always clean up as much as we can after each expression evaluation. Is this easy to do?
Thanks for testing decommit_delay=0 -- this shows that it is indeed the delay heuristic in mimalloc. This is a good thing though in general and we should not use decommit_delay=0 generally as it would slow down the allocator for large computations.
page_reset eagerly resets and is generally the best for lowest heap usage (but also costs a bit as the OS call is slowish)

I am thinking that I should probably enable the page_reset for large block pages where the cost would be amortized and we would see the largest gains. This would fix your example I think. However, the 32MiB is exactly the "huge" block boundary, can you do the same test with other block sizes, like 64 16MiB arrays, and 16 64MiB arrays?

Thanks

gvanrossum commented 2 years ago

I think generally, we should call mi_collect(true) in CPython when entering the interpreter prompt just in general so we always clean up as much as we can after each expression evaluation. Is this easy to do?

Should be easy enough for the REPL, but there are any number of other situations where this cleanup would be expected -- before waiting for a socket operation, before reading from stdin (or any tty device), before sleeping (or select/poll/etc.), and so on.

Maybe it should be done whenever we do a GC run too?

daanx commented 2 years ago

I think generally, we should call mi_collect(true) in CPython when entering the interpreter prompt just in general so we always clean up as much as we can after each expression evaluation. Is this easy to do?

Should be easy enough for the REPL, but there are any number of other situations where this cleanup would be expected -- before waiting for a socket operation, before reading from stdin (or any tty device), before sleeping (or select/poll/etc.), and so on.

Maybe it should be done whenever we do a GC run too?

It is possible to call this at a GC run, but I think it would be too aggressive. Generally the current mimalloc strategies work fine and mi_collect is only to be used in select cases (I have only seen one reasonable case before for a user-mode thread pool scheduler). For the interpreter we can make such case as running each expression is a bit like running a program and one would like to clean up after evaluation -- but doing it generally for blocking calls is not needed; mimalloc runs big services that are full of blocking calls without problems. The benchmark here is also a bit non-typical with just some huge blocks instead of many smaller allocations where the observed effects will be less. It may well be that we can even fix this by just doing eager decommit for large blocks only without needing mi_collect at all -- I will look into this.

(I am about to board a plane so won't respond for a while :-) )

gvanrossum commented 2 years ago

So maybe this whole "mimalloc doesn't release memory to the system" is a non-issue? Calling mi_collect() before entering the REPL only "works" because to demonstrate the problem people will do things like start an interactive interpreter, run a bit of code that allocates a huge block, free it again, and then in a separate shell run "ps" to observe the process size, while keeping the python process waiting at the next >>> prompt.

What I think you're saying is if instead they would just continue to run some other code (that doesn't reuse the huge block that was just freed) the excess memory will be returned to the system after some time (how is that measured? wall clock time? number of alloc/free ops?), and the observation that the huge block is not returned to the system is a mere artifact of the way people try to demonstrate the issue.

neonene commented 2 years ago

However, the 32MiB is exactly the "huge" block boundary, can you do the same test with other block sizes, like 64 16MiB arrays, and 16 64MiB arrays?

@daanx This example holds 168 MiB, and the same goes for the reversed arrays:

[bytearray(1024*1024 * [64,16][i%2]) for i in range(20)]  # 16MiB blocks are cached?

EDIT: Sorry, 64 16MiB arrays reserve 1GiB, 16 64MiB arrays do not.

Also, the cases below return about 1GiB to the OS with mi_collect():

[bytearray(1024*1024) for i in range(1024)]
[bytearray(1024*512)  for i in range(1024*2)]
[bytearray(1024*256)  for i in range(1024*4)]
[bytearray(1024*128)  for i in range(1024*8)]
[bytearray(1024*64)   for i in range(1024*16)]
[bytearray(1024*32)   for i in range(1024*32)]
[bytearray(1024*16)   for i in range(1024*64)]
[bytearray(1024*8)    for i in range(1024*128)]

What I'm trying is to simplify test_fstring.test_many_expressions. Thanks.

daanx commented 2 years ago

(back from ICFP)

Guido wrote:

So maybe this whole "mimalloc doesn't release memory to the system" is a non-issue? Calling mi_collect() before entering the REPL only "works" because to demonstrate the problem people will do things like start an interactive interpreter, run a bit of code that allocates a huge block, free it again, and then in a separate shell run "ps" to observe the process size, while keeping the python process waiting at the next >>> prompt. What I think you're saying is if instead they would just continue to run some other code (that doesn't reuse the huge block that was just freed) the excess memory will be returned to the system after some time (how is that measured? wall clock time? number of alloc/free ops?), and the observation that the huge block is not returned to the system is a mere artifact of the way people try to demonstrate the issue.

Yes, this is exactly right! Since reserving and releasing OS memory is so expensive, we delay the release a bit as the memory may be needed soon enough. The current delay is in wall clock time but fairly short (I think 500ms by default).

When someone runs an expression from the REPL, it will actually always be the case that we no longer require any OS memory to be retained afterwards (it is like running a separate program). So, in this particular case it is actually beneficial to call mi_collect(true) after evaluation. (but again, calling mi_collect in general is not).

daanx commented 2 years ago

Also, the cases below return about 1GiB to the OS with mi_collect():

Thanks for testing! When I get back home I will set things up so I can directly test these myself as well. I tried this also directly and verified that it wasn't something specific to huge objects so that's good.

thesamesam commented 1 year ago

[...] There is also the long-term maintenance issue. So far, mimalloc upstream has been responsive. The mimalloc code is not so huge or complicated that we couldn't maintain it (if for some reason it gets abandoned upstream). However, I think we would prefer to maintain obmalloc rather than mimalloc, all else being equal. Abandonment by the upstream seems fairly unlikely. So, I'm not too concerned about maintenance.

I made a similar point wrt the bundling at https://github.com/python/cpython/pull/109914#issuecomment-1784315022.

nascheme commented 10 months ago

I think this issue could be closed now that mimalloc went into HEAD already.

erlend-aasland commented 10 months ago

[...] There is also the long-term maintenance issue. So far, mimalloc upstream has been responsive. The mimalloc code is not so huge or complicated that we couldn't maintain it (if for some reason it gets abandoned upstream). However, I think we would prefer to maintain obmalloc rather than mimalloc, all else being equal. Abandonment by the upstream seems fairly unlikely. So, I'm not too concerned about maintenance.

I made a similar point wrt the bundling at #109914 (comment).

See follow-up issue:

113141

python / cpython

Add mimalloc memory allocator #90815

Linked PRs

113141