Closed tiran closed 10 months ago
From https://github.com/microsoft/mimalloc
mimalloc (pronounced "me-malloc") is a general purpose allocator with excellent performance characteristics. Initially developed by Daan Leijen for the run-time systems of the Koka and Lean languages.
mimalloc has several interesting properties that make it useful for CPython. Amongst other it is fast, thread-safe, and NUMA-aware. It has built-in free lists with multi-sharding and allocation heaps. While Python's obmalloc requires the GIL to protect its data structures, mimalloc uses mostly thread-local and atomic instructions (compare-and-swap) for efficiency. Sam Gross' nogil relies on mimalloc's thread safety and uses first-class heaps for heap walking GC.
mimalloc works on majority of platforms and CPU architectures. However it requires a compiler with C11 atomics support. CentOS 7's default GCC is slightly too old, more recent GCC from Developer Toolset is required.
For 3.11 I plan to integrate mimalloc as an optional drop-in replacement for obmalloc. Users will be able to compile CPython without mimalloc or disable mimalloc with PYTHONMALLOC env var. Since mimalloc will be optional in 3.11, Python won't depend or expose on any of the advanced features yet. The approach enables the community to test and give feedback with minimal risk of breakage.
mimalloc sources will vendored without any option to use system libraries. Python's mimalloc requires several non-standard compile-time flags. In the future Python may extend or modify mimalloc for heap walking and nogil, too.
(This is a tracking bug until I find time to finish a PEP.)
I add Neil to the nosy list since he is one of the kick-off members with this amazing works :)
New features:
Buildbots "PPC64 Fedora PR" and all RHEL 7 build bots provided by David Edelsohn are failing because compiler is missing support for stdatomic.h.
Thanks, I'm indeed interested. Most credit goes to Christian for advancing this.
For the missing stdatomic.h, would it be appropriate to have an autoconfig check for it? Can just disable mimalloc if it doesn't exist.
We have an autoconf check for stdatomic.h. The test even verifies that a program with atomic_load_explicit() compiles and links.
How do we want to use mimalloc in the future? Is it going to stay optional in 3.12? Then the default setting for --with-mimalloc should depend on presence of stdatomic.h. Do we want to make it mandatory for GC heap walking and nogil? Then --with-mimalloc should default to "yes" and configure should abort when stdatomic.h is missing.
I'm leaning towards --with-mimalloc=yes. It will make users aware that they need a compiler with atomics:
configure: error: --with-mimalloc requires stdatomic.h. Update your compiler or rebuild with --without-mimalloc. Python 3.12 will require stdatomic.
ICC might be a problem. Apparently some version have an incomplete stdatomic.h, see bpo-37415.
References:
My preference would be for --with-mimalloc=yes in an upcoming release. For platforms without the required stdatomic.h stuff, they can manually specify --with-mimalloc=no. That will make them aware that a future release of Python might no longer build (if mimalloc is no longer optional).
A soft-landing for merging nogil is not a good enough reason to merge mimalloc, IMHO. nogil may never be merged. There should be some concrete and immediate advantage to switch to mimalloc. The idea of using the "heap walking" to improve is cyclic GC is not concrete enough. It's just an idea at this point.
I think the (small) performance win could be enough of a reason to merge. This seems to be the most recent benchmark:
https://gist.github.com/pablogsal/8027937b71cd30f17aaaa5ef7c885d3e
There is also the long-term maintenance issue. So far, mimalloc upstream has been responsive. The mimalloc code is not so huge or complicated that we couldn't maintain it (if for some reason it gets abandoned upstream). However, I think we would prefer to maintain obmalloc rather than mimalloc, all else being equal. Abandonment by the upstream seems fairly unlikely. So, I'm not too concerned about maintenance.
New benchmark:
Benchmark | 2022-02-08_11-54-master-69e10976b2e7 | 2022-02-08_11-57-master-d6f5f010b586 |
---|---|---|
mako | 8.85 ms | 7.83 ms: 1.13x faster |
hexiom | 6.04 ms | 5.54 ms: 1.09x faster |
spectral_norm | 81.4 ms | 75.2 ms: 1.08x faster |
pyflate | 380 ms | 352 ms: 1.08x faster |
scimark_sparse_mat_mult | 4.05 ms | 3.76 ms: 1.08x faster |
pickle_pure_python | 312 us | 290 us: 1.07x faster |
unpickle_pure_python | 238 us | 222 us: 1.07x faster |
float | 63.1 ms | 59.5 ms: 1.06x faster |
tornado_http | 90.3 ms | 86.0 ms: 1.05x faster |
html5lib | 62.8 ms | 60.2 ms: 1.04x faster |
regex_compile | 121 ms | 116 ms: 1.04x faster |
scimark_lu | 106 ms | 102 ms: 1.04x faster |
nqueens | 70.9 ms | 68.4 ms: 1.04x faster |
crypto_pyaes | 70.1 ms | 67.8 ms: 1.03x faster |
logging_silent | 97.5 ns | 94.4 ns: 1.03x faster |
sympy_integrate | 17.2 ms | 16.7 ms: 1.03x faster |
sympy_str | 260 ms | 252 ms: 1.03x faster |
sympy_expand | 441 ms | 427 ms: 1.03x faster |
pathlib | 14.1 ms | 13.7 ms: 1.03x faster |
regex_dna | 164 ms | 159 ms: 1.03x faster |
regex_v8 | 21.1 ms | 20.6 ms: 1.02x faster |
sympy_sum | 138 ms | 136 ms: 1.02x faster |
scimark_fft | 286 ms | 281 ms: 1.02x faster |
pickle | 9.34 us | 9.19 us: 1.02x faster |
xml_etree_parse | 126 ms | 124 ms: 1.01x faster |
richards | 43.0 ms | 42.4 ms: 1.01x faster |
xml_etree_generate | 71.2 ms | 70.5 ms: 1.01x faster |
scimark_monte_carlo | 58.8 ms | 58.3 ms: 1.01x faster |
deltablue | 3.60 ms | 3.58 ms: 1.01x faster |
chaos | 64.6 ms | 64.3 ms: 1.01x faster |
2to3 | 216 ms | 215 ms: 1.00x faster |
pidigits | 155 ms | 154 ms: 1.00x faster |
nbody | 76.4 ms | 77.0 ms: 1.01x slower |
python_startup_no_site | 3.96 ms | 3.99 ms: 1.01x slower |
xml_etree_iterparse | 82.5 ms | 83.1 ms: 1.01x slower |
scimark_sor | 103 ms | 104 ms: 1.01x slower |
unpickle | 11.3 us | 11.4 us: 1.01x slower |
telco | 5.53 ms | 5.58 ms: 1.01x slower |
python_startup | 5.56 ms | 5.62 ms: 1.01x slower |
json_loads | 20.6 us | 20.8 us: 1.01x slower |
json_dumps | 9.61 ms | 9.77 ms: 1.02x slower |
dulwich_log | 60.9 ms | 62.1 ms: 1.02x slower |
logging_format | 5.47 us | 5.62 us: 1.03x slower |
pickle_list | 3.06 us | 3.15 us: 1.03x slower |
django_template | 30.2 ms | 31.2 ms: 1.03x slower |
meteor_contest | 80.7 ms | 84.1 ms: 1.04x slower |
pickle_dict | 21.9 us | 23.4 us: 1.07x slower |
logging_simple | 4.84 us | 5.20 us: 1.07x slower |
Geometric mean | (ref) | 1.01x faster |
Benchmark hidden because not significant (9): unpack_sequence, go, raytrace, chameleon, xml_etree_process, fannkuch, sqlite_synth, regex_effbot, unpickle_list
I re-ran the benchmark of d6f5f010b586:
Benchmark | 2022-02-08_11-54-master-69e10976b2e7 | 2022-02-08_11-57-master-d6f5f010b586 |
---|---|---|
pickle_pure_python | 312 us | 281 us: 1.11x faster |
unpickle_pure_python | 238 us | 216 us: 1.10x faster |
pyflate | 380 ms | 349 ms: 1.09x faster |
hexiom | 6.04 ms | 5.55 ms: 1.09x faster |
logging_silent | 97.5 ns | 89.8 ns: 1.09x faster |
float | 63.1 ms | 59.3 ms: 1.07x faster |
html5lib | 62.8 ms | 59.1 ms: 1.06x faster |
crypto_pyaes | 70.1 ms | 66.1 ms: 1.06x faster |
json_loads | 20.6 us | 19.4 us: 1.06x faster |
tornado_http | 90.3 ms | 86.1 ms: 1.05x faster |
mako | 8.85 ms | 8.45 ms: 1.05x faster |
richards | 43.0 ms | 41.1 ms: 1.05x faster |
xml_etree_parse | 126 ms | 120 ms: 1.05x faster |
logging_format | 5.47 us | 5.25 us: 1.04x faster |
sympy_integrate | 17.2 ms | 16.5 ms: 1.04x faster |
sympy_str | 260 ms | 251 ms: 1.04x faster |
fannkuch | 325 ms | 314 ms: 1.04x faster |
regex_v8 | 21.1 ms | 20.4 ms: 1.04x faster |
sympy_expand | 441 ms | 425 ms: 1.04x faster |
regex_compile | 121 ms | 117 ms: 1.03x faster |
sympy_sum | 138 ms | 134 ms: 1.03x faster |
scimark_lu | 106 ms | 103 ms: 1.03x faster |
go | 128 ms | 125 ms: 1.03x faster |
pathlib | 14.1 ms | 13.7 ms: 1.02x faster |
scimark_monte_carlo | 58.8 ms | 57.9 ms: 1.02x faster |
nqueens | 70.9 ms | 69.9 ms: 1.02x faster |
pidigits | 155 ms | 153 ms: 1.01x faster |
pickle | 9.34 us | 9.22 us: 1.01x faster |
raytrace | 278 ms | 275 ms: 1.01x faster |
2to3 | 216 ms | 213 ms: 1.01x faster |
deltablue | 3.60 ms | 3.56 ms: 1.01x faster |
logging_simple | 4.84 us | 4.78 us: 1.01x faster |
xml_etree_iterparse | 82.5 ms | 81.7 ms: 1.01x faster |
regex_dna | 164 ms | 162 ms: 1.01x faster |
unpack_sequence | 32.7 ns | 32.4 ns: 1.01x faster |
telco | 5.53 ms | 5.48 ms: 1.01x faster |
python_startup | 5.56 ms | 5.58 ms: 1.00x slower |
xml_etree_generate | 71.2 ms | 71.6 ms: 1.01x slower |
unpickle_list | 4.08 us | 4.12 us: 1.01x slower |
chameleon | 6.07 ms | 6.14 ms: 1.01x slower |
chaos | 64.6 ms | 65.3 ms: 1.01x slower |
json_dumps | 9.61 ms | 9.75 ms: 1.01x slower |
xml_etree_process | 49.9 ms | 50.7 ms: 1.01x slower |
meteor_contest | 80.7 ms | 82.0 ms: 1.02x slower |
scimark_sparse_mat_mult | 4.05 ms | 4.12 ms: 1.02x slower |
unpickle | 11.3 us | 11.5 us: 1.02x slower |
django_template | 30.2 ms | 31.0 ms: 1.02x slower |
scimark_sor | 103 ms | 106 ms: 1.02x slower |
spectral_norm | 81.4 ms | 84.9 ms: 1.04x slower |
pickle_dict | 21.9 us | 23.5 us: 1.08x slower |
Geometric mean | (ref) | 1.02x faster |
Benchmark hidden because not significant (7): scimark_fft, dulwich_log, python_startup_no_site, regex_effbot, sqlite_synth, nbody, pickle_list
ICC 2021 has full support for stdatomic.h and compiles mimalloc just fine:
$ CC="icc" ./configure -C --with-pydebug
$ make
$ ./python
Python 3.11.0a5+ (main, Feb 9 2022, 15:57:40) [GCC Intel(R) C++ gcc 7.5 mode] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys._malloc_info
sys._malloc_info(allocator='mimalloc_debug', with_pymalloc=True, with_mimalloc=True, mimalloc_secure=4, mimalloc_debug=2)
AIX xlc is still a problem. It does not support C11 stdatomic.h. But it comes with older GCC atomic memory access __sync function family, https://www.ibm.com/docs/en/xl-c-and-cpp-aix/13.1.3?topic=cbif-gcc-atomic-memory-access-built-in-functions-extension . It might be possible to re-implement mimalloc's atomics with __sync functions (e.g. https://gist.github.com/nhatminhle/5181506). The implementation would be less efficient, though. The __sync functions don't have memory order, atomic_loadexplicit(v) becomes \_sync_fetch_and_add(v, 0), and atomic_store_explicit() requires two full memory barriers.
FYI, PR #31164 indicates the following 3 blockers:
After a conversation between @tiran, @daanx (the mimalloc creator), and me, we have a good idea on how to move forward. I'm looking into the PGO issue (with some help).
The PGO-on-Windows issue may be resolved now (thanks to gh-94790). Here's what I did to check:
tiran:bpo-46657-mimalloc
(PR gh-31164)
a. run git merge main
b. fix Makefile.pre.in
c. run git checkout main -- Python/clinic/sysmodule.c.h
python3 Tools/clinic/clinic.py --make
python3 Tools/scripts/generate_global_objects.py
_PyStructSequence_InitType
to _PyStructSequence_InitBuiltinWithFlags
in Python/sysmodule.c.\PCbuild\build.bat --pgo
@neonene Could you verify that we should expect PGO builds to be working correctly now? Thanks!
Do these steps point towards improvements for @tiran's mimalloc branch?
Only that the branch needs to merge in main.
@ericsnowcurrently Thanks for the check. I can also build the newest mimalloc branch successfully with PGO on Windows.
As for the issue around releasing memory, my builds still show:
>>> from test.support import Py_DEBUG
>>> about32MiB = 1024*1024*32 - (32*Py_DEBUG)
>>> b = [bytearray(about32MiB) for i in range(3)]
>>> del b # 100MB is freed to the OS
>>> b = [bytearray(about32MiB - 1) for i in range(3)]
>>> del b # 100MB remains cached
>>> sys._mi_collect(True) # purge cached segments
The threshold size was 1024*1024*32 - (32*Py_DEBUG) - 32
when I used bytes()
instead.
@ericsnowcurrently Thanks for the check. I can also build the newest mimalloc branch successfully with PGO on Windows.
Great! I'll cross it off my list. 😄
As for the issue around releasing memory, my builds still show:
Yeah, I'm going to follow up with @daanx about this.
subscribing here to learn how the issue with mimalloc is being resolved. My project uses mimalloc as well, and I am curious to learn how to configure it to have a slimmer memory footprint.
@romange, take a look at https://github.com/python/cpython/pull/31164#issuecomment-1133583811. Other than that, there's only the existing mimalloc documentation, as well as any insight @daanx can offer.
@ericsnowcurrently thanks! btw, I chose mimalloc because of its elegant design and its thread-local heap support. However, it seems that the project is not under active development anymore. Maybe it's because of good reasons and everything is perfect, but I can see how it can become a problem in the future (and daanx is a one-person army). What do you think about it?
Hi @romange, here is me: just to clarify mimalloc is actively maintained and I am dedicated to make the CPython integration work. However, I did take some needed time off this summer, :-) but am back now and hope to help resolve the outstanding issues. I already started work on the ASAN support so hope that is coming soon (although I am at ICFP'23 next week).
Hi @daanx I am psyched to hear! :) and sorry for quickly jumping to conclusions 🤦🏼♂️
@ericsnowcurrently Thanks for the check. I can also build the newest mimalloc branch successfully with PGO on Windows.
As for the issue around releasing memory, my builds still show:
>>> from test.support import Py_DEBUG >>> about32MiB = 1024*1024*32 - (32*Py_DEBUG) >>> b = [bytearray(about32MiB) for i in range(3)] >>> del b # 100MB is freed to the OS >>> b = [bytearray(about32MiB - 1) for i in range(3)] >>> del b # 100MB remains cached >>> sys._mi_collect(True) # purge cached segments
The threshold size was
1024*1024*32 - (32*Py_DEBUG) - 32
when I usedbytes()
instead.
Thanks for testing this, just to clarify:
sys.mi_collect(True)
, was there still 100MiB being "cached"Generally, when an allocator requests memory from the OS (when allocating the bytearray for example), it should not immediately free that OS memory when memory is free
'd just in case the program will start allocating again (which is usually the case). mimalloc
does this as well and only releases more unused OS memory after a while.
However, at an interpreter prompt the program is just waiting and may therefore not get an opportunity to do this. In such case the CPython interpreter should call sys._mi_collect(True)
to collect outstanding OS memory just before blocking for user input.
(In case you still see an increased usage after manually calling this as in your test above, something else may be going on which we should investigate in that case).
When I tried the following steps on Windows:
b = [bytearray(1024*1024*32 - 1) for i in range(32)]
del b
sys._mi_collect(True)
Then, TaskManager showed that the memory usages of python.exe are: |
65af44c (PR) | 1 | 2.del | 3.collect |
---|---|---|---|---|
v2.0.6 as-is (options) | 1 GB | 1 GB | 40,932 KB | holds 32MB? |
v2.0.6 decommit_delay = 0 | 1 GB | 8,392 KB | 8,112 KB | |
v2.0.6 page_reset = 1 | 1 GB | 6,480 KB | 8,060 KB | slower than above? |
main | 1 | 2.del |
---|---|---|
pymalloc | 1 GB | 3,760 K |
Thanks for the quick test and explanation!
mi_collect(true)
in CPython when entering the interpreter prompt just in general so we always clean up as much as we can after each expression evaluation. Is this easy to do? decommit_delay=0
-- this shows that it is indeed the delay heuristic in mimalloc. This is a good thing though in general and we should not use decommit_delay=0
generally as it would slow down the allocator for large computations.page_reset
eagerly resets and is generally the best for lowest heap usage (but also costs a bit as the OS call is slowish)I am thinking that I should probably enable the page_reset
for large block pages where the cost would be amortized and we would see the largest gains. This would fix your example I think. However, the 32MiB is exactly the "huge" block boundary, can you do the same test with other block sizes, like 64 16MiB arrays, and 16 64MiB arrays?
Thanks
I think generally, we should call
mi_collect(true)
in CPython when entering the interpreter prompt just in general so we always clean up as much as we can after each expression evaluation. Is this easy to do?
Should be easy enough for the REPL, but there are any number of other situations where this cleanup would be expected -- before waiting for a socket operation, before reading from stdin (or any tty device), before sleeping (or select/poll/etc.), and so on.
Maybe it should be done whenever we do a GC run too?
I think generally, we should call
mi_collect(true)
in CPython when entering the interpreter prompt just in general so we always clean up as much as we can after each expression evaluation. Is this easy to do?Should be easy enough for the REPL, but there are any number of other situations where this cleanup would be expected -- before waiting for a socket operation, before reading from stdin (or any tty device), before sleeping (or select/poll/etc.), and so on.
Maybe it should be done whenever we do a GC run too?
It is possible to call this at a GC run, but I think it would be too aggressive. Generally the current mimalloc strategies work fine and mi_collect
is only to be used in select cases (I have only seen one reasonable case before for a user-mode thread pool scheduler). For the interpreter we can make such case as running each expression is a bit like running a program and one would like to clean up after evaluation -- but doing it generally for blocking calls is not needed; mimalloc runs big services that are full of blocking calls without problems. The benchmark here is also a bit non-typical with just some huge blocks instead of many smaller allocations where the observed effects will be less. It may well be that we can even fix this by just doing eager decommit for large blocks only without needing mi_collect at all -- I will look into this.
(I am about to board a plane so won't respond for a while :-) )
So maybe this whole "mimalloc doesn't release memory to the system" is a non-issue? Calling mi_collect()
before entering the REPL only "works" because to demonstrate the problem people will do things like start an interactive interpreter, run a bit of code that allocates a huge block, free it again, and then in a separate shell run "ps" to observe the process size, while keeping the python process waiting at the next >>>
prompt.
What I think you're saying is if instead they would just continue to run some other code (that doesn't reuse the huge block that was just freed) the excess memory will be returned to the system after some time (how is that measured? wall clock time? number of alloc/free ops?), and the observation that the huge block is not returned to the system is a mere artifact of the way people try to demonstrate the issue.
However, the 32MiB is exactly the "huge" block boundary, can you do the same test with other block sizes, like 64 16MiB arrays, and 16 64MiB arrays?
@daanx This example holds 168 MiB, and the same goes for the reversed arrays:
[bytearray(1024*1024 * [64,16][i%2]) for i in range(20)] # 16MiB blocks are cached?
EDIT: Sorry, 64 16MiB arrays reserve 1GiB, 16 64MiB arrays do not.
Also, the cases below return about 1GiB to the OS with mi_collect()
:
[bytearray(1024*1024) for i in range(1024)]
[bytearray(1024*512) for i in range(1024*2)]
[bytearray(1024*256) for i in range(1024*4)]
[bytearray(1024*128) for i in range(1024*8)]
[bytearray(1024*64) for i in range(1024*16)]
[bytearray(1024*32) for i in range(1024*32)]
[bytearray(1024*16) for i in range(1024*64)]
[bytearray(1024*8) for i in range(1024*128)]
What I'm trying is to simplify test_fstring.test_many_expressions
.
Thanks.
(back from ICFP)
Guido wrote:
So maybe this whole "mimalloc doesn't release memory to the system" is a non-issue? Calling
mi_collect()
before entering the REPL only "works" because to demonstrate the problem people will do things like start an interactive interpreter, run a bit of code that allocates a huge block, free it again, and then in a separate shell run "ps" to observe the process size, while keeping the python process waiting at the next>>>
prompt. What I think you're saying is if instead they would just continue to run some other code (that doesn't reuse the huge block that was just freed) the excess memory will be returned to the system after some time (how is that measured? wall clock time? number of alloc/free ops?), and the observation that the huge block is not returned to the system is a mere artifact of the way people try to demonstrate the issue.
Yes, this is exactly right! Since reserving and releasing OS memory is so expensive, we delay the release a bit as the memory may be needed soon enough. The current delay is in wall clock time but fairly short (I think 500ms by default).
When someone runs an expression from the REPL, it will actually always be the case that we no longer require any OS memory to be retained afterwards (it is like running a separate program). So, in this particular case it is actually beneficial to call mi_collect(true)
after evaluation. (but again, calling mi_collect
in general is not).
Also, the cases below return about 1GiB to the OS with
mi_collect()
:
Thanks for testing! When I get back home I will set things up so I can directly test these myself as well. I tried this also directly and verified that it wasn't something specific to huge objects so that's good.
[...] There is also the long-term maintenance issue. So far, mimalloc upstream has been responsive. The mimalloc code is not so huge or complicated that we couldn't maintain it (if for some reason it gets abandoned upstream). However, I think we would prefer to maintain obmalloc rather than mimalloc, all else being equal. Abandonment by the upstream seems fairly unlikely. So, I'm not too concerned about maintenance.
I made a similar point wrt the bundling at https://github.com/python/cpython/pull/109914#issuecomment-1784315022.
I think this issue could be closed now that mimalloc went into HEAD already.
[...] There is also the long-term maintenance issue. So far, mimalloc upstream has been responsive. The mimalloc code is not so huge or complicated that we couldn't maintain it (if for some reason it gets abandoned upstream). However, I think we would prefer to maintain obmalloc rather than mimalloc, all else being equal. Abandonment by the upstream seems fairly unlikely. So, I'm not too concerned about maintenance.
I made a similar point wrt the bundling at #109914 (comment).
See follow-up issue:
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['interpreter-core', 'type-feature', '3.11']
title = 'Add mimalloc memory allocator'
updated_at =
user = 'https://github.com/tiran'
```
bugs.python.org fields:
```python
activity =
actor = 'h-vetinari'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core']
creation =
creator = 'christian.heimes'
dependencies = []
files = []
hgrepos = []
issue_num = 46657
keywords = ['patch']
message_count = 12.0
messages = ['412639', '412641', '412645', '412654', '412669', '412679', '412741', '412749', '412796', '412835', '412867', '412986']
nosy_count = 5.0
nosy_names = ['nascheme', 'christian.heimes', 'corona10', 'erlendaasland', 'h-vetinari']
pr_nums = ['31164']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue46657'
versions = ['Python 3.11']
```
Linked PRs