python / cpython

The Python programming language
https://www.python.org
Other
62k stars 29.81k forks source link

Customized malloc implementation on SunOS and AIX #47776

Closed cf2aeb3c-f43f-407a-8d37-30a2b5033222 closed 11 years ago

cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 16 years ago
BPO 3526
Nosy @tim-one, @loewis, @pitrou
Files
  • customized_malloc_SUN.pdf
  • customized_malloc_AIX.pdf
  • patch_dlmalloc.diff
  • patch_dlmalloc2.diff
  • patch_dlmalloc3.diff
  • patch_dlmalloc_Python_2_7_1.diff: Patch to use dlmalloc in Python - updated for Python 2.7.1 and to only use mmap
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', 'performance'] title = 'Customized malloc implementation on SunOS and AIX' updated_at = user = 'https://bugs.python.org/sable' ``` bugs.python.org fields: ```python activity = actor = 'pitrou' assignee = 'none' closed = True closed_date = closer = 'pitrou' components = ['Interpreter Core'] creation = creator = 'sable' dependencies = [] files = ['11082', '11083', '11084', '11445', '11459', '22698'] hgrepos = [] issue_num = 3526 keywords = ['patch'] message_count = 38.0 messages = ['70897', '70908', '70920', '70929', '70940', '70945', '72382', '72750', '72758', '72761', '72762', '72876', '72975', '110893', '111255', '115620', '134330', '134470', '134485', '134489', '134491', '134495', '134774', '134775', '134777', '134780', '134783', '134785', '134794', '134808', '134810', '135130', '135148', '140678', '140681', '140682', '141177', '194939'] nosy_count = 7.0 nosy_names = ['tim.peters', 'loewis', 'pitrou', 'sable', 'flub', 'neologix', 'BreamoreBoy'] pr_nums = [] priority = 'normal' resolution = 'wont fix' stage = None status = 'closed' superseder = None type = 'resource usage' url = 'https://bugs.python.org/issue3526' versions = ['Python 3.1', 'Python 2.7', 'Python 3.2'] ```

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 16 years ago

    Hi,

    We run a big application mostly written in Python (with Pyrex/C extensions) on different systems including Linux, SunOS and AIX.

    The memory footprint of our application on Linux is fine; however we found that on AIX and SunOS, any memory that has been allocated by our application at some stage will never be freed at the system level.

    After doing some analysis (see the 2 attached pdf documents), we found that this is linked to the implementation of malloc on those various systems:

    The malloc used on Linux (glibc) is based on dlmalloc as described in this document: http://g.oswego.edu/dl/html/malloc.html

    This implementation will use sbrk to allocate small chunks of memory, but it will use mmap to allocate big chunks. This ensures that the memory will actually get freed when free is called.

    AIX and Sun have a more naive malloc implementation, so that the memory allocated by an application through malloc is never actually freed until the application leaves (this behavior has been confirmed by some experts at IBM and Sun when we asked them for some feedback on this problem - there is a 'memory disclaim' option on AIX but it is disabled by default as it brings some major performance penalities).

    For long running Python applications which may allocate a lot of memory at some stage, this is a major drawback.

    In order to bypass this limitation of the system on AIX and SunOS, we have modified Python so that it will use the customized malloc implementation dlmalloc like in glibc (see attached patch) - dlmalloc is released in the public domain.

    This patch adds a --enable-dlmalloc option to configure. When activated, we observed a dramatic reduction of the memory used by our application. I think many AIX and SunOS Python users could be interested by such an improvement.

    -- Sébastien Sablé Sungard

    pitrou commented 16 years ago

    This is very interesting, although it should probably go through discussion on python-dev since it involves integrating a big chunk of external code.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 16 years ago

    I cannot quite see why the problem is serious: even though the memory is not returned to the system, it will be swapped out to the swap file, so it doesn't consume any real memory (just swap space).

    I don't think Python should integrate a separate malloc implementation. Instead, Python's own memory allocate (obmalloc) should be changed to directly use the virtual memory interfaces of the operating system (i.e. mmap), bypassing the malloc of the C library.

    So I'm -1 on this patch.

    pitrou commented 16 years ago

    Le vendredi 08 août 2008 à 22:46 +0000, Martin v. Löwis a écrit :

    Instead, Python's own memory allocate (obmalloc) should be changed to directly use the virtual memory interfaces of the operating system (i.e. mmap), bypassing the malloc of the C library.

    How would that interact with fork()?

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 16 years ago

    > Instead, Python's own memory allocate (obmalloc) should be changed to > directly use the virtual memory interfaces of the operating system (i.e. > mmap), bypassing the malloc of the C library.

    How would that interact with fork()?

    Nicely, why do you ask? Any anonymous mapping will be copied (typically COW) to the child process, in fact, malloc itself uses anonymous mapping (at least on Linux).

    pitrou commented 16 years ago

    Le samedi 09 août 2008 à 17:28 +0000, Martin v. Löwis a écrit :

    Martin v. Löwis \martin@v.loewis.de\ added the comment:

    >> Instead, Python's own memory allocate (obmalloc) should be changed to >> directly use the virtual memory interfaces of the operating system (i.e. >> mmap), bypassing the malloc of the C library. > > How would that interact with fork()?

    Nicely, why do you ask?

    Because I didn't know :) But looking at the dlmalloc implementation bundled in the patch, it seems that using mmap/munmap (or VirtualAlloc/VirtualFree under Windows) should be ok.

    Do you think we should create a separate issue for this improvement? It could also solve bpo-3531.

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 15 years ago

    [sorry for the late reply, I have been on holidays]

    Martin: you are right that this memory is moved to swap and does not consume any "real" memory; however we decided to work on this patch because we observed on our application some performances degradation due to this memory not being deallocated correctly.

    Since then we have done some quite extensive tests (with the help of a consultant at Sun): they have shown that this unnecessary swapping has a noticeable impact on performances and at worst, when the system memory is saturated, can completely put a server on its knees for several minutes (we're talking of top of the line SunOS and AIX servers with hundreds of GB of memory).

    I will write a complete document explaining the tests and observations that we did, but this memory issue was critical for us given the degradation of performances it was generating on our production servers.

    Concerning dlmalloc, you are right that it would be cleaner to improve obmalloc so that it uses mmap when necessary, instead of adding another layer with dlmalloc (even though that is what actually currently happens on linux systems where dlmalloc is integrated in libc).

    I will try to do that patch in coming weeks (obmalloc mostly allocates some 256KB arenas so it should nearly always use mmap).

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 15 years ago

    I will try to do that patch in coming weeks (obmalloc mostly allocates some 256KB arenas so it should nearly always use mmap).

    Exactly so. If you can, please also consider supporting Windows, in the same way.

    Anything in obmalloc that is not arena space should continue to come from malloc, I believe.

    tim-one commented 15 years ago

    Anything in obmalloc that is not arena space should continue to come from malloc, I believe.

    Sorry, but I don't understand why arena space should be different. If a platform's libc implementers think mmap should be used to obtain 256KB chunks (i.e., arenas), then surely they implement the platform malloc to defer to mmap in such cases. If they don't but "should", then bugging the platform vendor to improve the system malloc in this respect is the best idea (then all apps on the platform benefit, and Python stays simpler).

    OTOH, if for some compelling reason it's believed Python knows better than platform vendors, then obmalloc should be uglied-up on all paths to make the enlightened choice.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 15 years ago

    OTOH, if for some compelling reason it's believed Python knows better than platform vendors, then obmalloc should be uglied-up on all paths to make the enlightened choice.

    I'm proposing that obmalloc is changed to know better than system malloc on systems supporting anonymous mmap, and Windows, and that the call

       malloc(ARENA_SIZE)

    is replaced by mmap. This has the advantage of doing better than system malloc on Solaris, plus it also might guarantee that arenas will be POOL_SIZE aligned.

    OTOH, the calls

      realloc(arenas, nbytes)
      malloc(nbytes)

    should continue to go to system malloc, because they are typically not multiples of the system page size.

    tim-one commented 15 years ago

    I have to admit that if Python /didn't/ know better than platform libc implementers in some cases, there would be no point to having obmalloc at all :-(

    What you (Martin) suggest is reasonable enough.

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 15 years ago

    Here is a new patch so that pymalloc can be combined with dlmalloc.

    I first added the --with-pymalloc-mmap option to configure.in which ensures that pymalloc arenas are allocated through mmap when possible.

    However I found this was not enough: PyObject_Malloc uses arenas only when handling objects smaller than 256 bytes. For bigger objects, it directly rely on the system malloc. There are also some big buffers which can be directly allocated through PyMem_MALLOC.

    This patch can be activated by compiling Python with: --with-pymalloc --with-pymalloc-mmap --with-dlmalloc

    The behavior is then like that:

    I think it is a good compromise: On systems like Linux, where the system malloc is already clever enough, compiling with only --with-pymalloc should behave like before. On systems like SunOS and AIX, this patch ensures that Python can benefit of the speed of pymalloc for small objects, while ensuring that most of the memory allocated can be correctly released at the system level.

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 15 years ago

    My previous patch has a small problem as I believed dlmalloc was always returning a non-NULL value, even when asking for 0 bytes.

    It turns out not to be the case, so here is a new patch (patch_dlmalloc3.diff) which must be applied after the previous one (patch_dlmalloc2.diff) to correct this problem.

    83d2e70e-e599-4a04-b820-3814bbdb9bef commented 14 years ago

    Any SunOS/AIX people interested in keeping this open?

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 14 years ago

    Well I am still interested in getting this patch officially integrated in Python.

    This patch is integrated in the version of Python that we deploy to our customers with our products (Sungard GP3). So it runs in production at various clients sites (some European banks with massive SunOs and AIX servers running thousands of sessions of our application) and it has provided some huge memory consumption improvements.

    The problem appears quite obviously when you run a relatively big application on SunOS or AIX: if you allocate some memory in a Python process at some stage, this memory will never be released to the system until you leave that process, even if that memory is not used by Python anymore. With my patch, the process can actually release the memory to the system so that it can be used by other processes.

    Linux is not impacted by this problem because the GNU libc implements the same memory allocation mechanism based on dlmalloc.

    I guess there are not that many people running Python applications with a big memory footprint on AIX or SunOS, otherwise this problem would be more popular.

    pitrou commented 13 years ago

    I guess there are not that many people running Python applications with a big memory footprint on AIX or SunOS, otherwise this problem would be more popular.

    Not only, but integrating a big chunk of foreign code in something as critical as the memory allocation routines is not an easy decision to make. Also, the dlmalloc copy should then be regularly kept in sync with upstream.

    79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 13 years ago

    Sébastien: I'm chiming in late, but doesn't AIX have something like LD_PRELOAD? Why not use it to transparently replace AIX's legacy malloc by another malloc implementation like dlmalloc or ptmalloc? That would not require any patching of Python, and could also be used for other applications.

    As a side note, while mmap has some advantages, it is way slower than brk (because pages must be zero-filled, and since mmap/munmap is called at every malloc/free call, this zero-filling is done every time contrarily to brk pools). See http://sources.redhat.com/ml/libc-alpha/2006-03/msg00033.html

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 13 years ago

    Hi Charles-François,

    it is possible to impact the memory allocation system on AIX using some environment variables (MALLOCOPTIONS and others), but it is not very elegant (it will impact all applications running with this environment and it is difficult to ensure that those environment variables will be correctly set when distributing an application to a customer) and I am afraid most users will never hear about that and will just use the default behavior.

    Concerning mmap performances, dlmalloc has a pool mechanism and Python has its own pool mechanism on top of that. As a result, system calls to allocate memory do not happen frequently since the memory allocation is usually handled internally in those pools and dlmalloc is often faster than the native malloc.

    I have been distributing a version of Python which integrates this patch with the application on which I work to various customers for the last few years and the benchmarks have not shown any significant performance degradation. On the other hand, the decrease in memory consumption has been clearly noticed and appreciated.

    Also note that dlmalloc (or a derivative - ptmalloc) is part of GNU glibc which is used by most Linux systems, and is what you get when you call malloc. http://en.wikipedia.org/wiki/Malloc#dlmalloc_and_its_derivatives

    So by using dlmalloc on SunOS and AIX you would get the same level of performance for memory operations that you already probably can appreciate on Linux systems.

    79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 13 years ago

    it is possible to impact the memory allocation system on AIX using some environment variables (MALLOCOPTIONS and others)

    LD_PRELOAD won't impact AIX's malloc behaviour, but allows you to replace it transparently by any other implementation you like (dlmalloc, ptmalloc, ...), without touching neither cpython nor your application.

    For example, let's says I want a Python version where getpid always returns 42.

    $ cat /tmp/pid.c
    int getpid(void)
    {
            return 42;
    }
    $ gcc -o /tmp/pid.so /tmp/pid.c -fpic -shared

    Now,

    $ LD_PRELOAD=/tmp/pid.so python -c 'import os; print(os.getpid())'
    42

    That's it. If you replace pid.so by dlmalloc.so, you'll be using dlmalloc instead of AIX's malloc, without having modified a single line of code. If you're concerned with impacting other applications, then you could do something like:

    $ cat python.c
    #include <stdlib.h>
    #include <unistd.h>
    
    int main(int argc, char *argv[])
    {
            setenv("LD_PRELOAD", "/tmp/pid.so", 1);
            execvl(<path to real python>, argv);
    
            return 1;
    }

    And then: $ ./python -c 'import os; print(os.getpid())' 42

    Also note that dlmalloc (or a derivative - ptmalloc) is part of GNU glibc which is used by most Linux systems, and is what you get when you call malloc. http://en.wikipedia.org/wiki/Malloc#dlmalloc_and_its_derivatives

    Actually, glibc/eglibc versions have diverged quite a lot from the original ptmalloc2, see for example http://bugs.python.org/issue11849 (that's one reason why embedding such a huge piece of code into Python is probably not a good idea as highlighted by Antoine, it's updated fairly frequently).

    So by using dlmalloc on SunOS and AIX you would get the same level of performance for memory operations that you already probably can appreciate on Linux systems.

    Yes, but with the above "trick", you can do that without patching python nor your app. I mean, if you start embedding malloc in python, why stop there, and not embed the whole glibc ;-) Note that I realize this won't solve the problem for other AIX users (if there are any left :-), but since this patch doesn't seem to be gaining adhesion, I'm just proposing an alternative that I find cleaner, simpler and easier to maintain.

    92935ae4-c5d3-4cd3-81e6-25bec3013308 commented 13 years ago

    > So by using dlmalloc on SunOS and AIX you would get the same level > of performance for memory operations that you already probably can > appreciate on Linux systems.

    Yes, but with the above "trick", you can do that without patching python nor your app. I mean, if you start embedding malloc in python, why stop there, and not embed the whole glibc ;-) Note that I realize this won't solve the problem for other AIX users (if there are any left :-), but since this patch doesn't seem to be gaining adhesion, I'm just proposing an alternative that I find cleaner, simpler and easier to maintain.

    This trick is hard to find however and I don't think it serves Solaris and AIX users very much (and sadly IBM keeps pushing AIX so yes it's used more then I like :-( ).

    So how about a --with-dlmalloc=path/to/dlmalloc.c? This way the dlmalloc code does not live inside Python and doesn't need to be maintained by python. But python still supports the code and will easily be built using it. Add a note in the README for AIX and Solaris and I think this would be a lot friendlier to users. This is similar in how python uses e.g. openssl to provide optional extra functionality/performance.

    pitrou commented 13 years ago

    So how about a --with-dlmalloc=path/to/dlmalloc.c?

    Can't you just add dlmalloc to LDFLAGS or something? Or would the default malloc still be selected?

    This is similar in how python uses e.g. openssl to provide optional extra functionality/performance.

    It's not really similar. OpenSSL provides functionality that's not available through the standard library. Here, we're talking about an alternative implementation of the standard C routines.

    79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 13 years ago

    I just noticed there's already a version of dlmalloc in Modules/_ctypes/libffi/src/dlmalloc.c

    Compiling with gcc -shared -fpic -o /tmp/dlmalloc.so ./Modules/_ctypes/libffi/src/dlmalloc.c

    Then LD_PRELOAD=/tmp/dlmalloc.so ./python

    works just fine (and by the way, it solves the problem with glibc's version in bpo-11849, it's somewhat slower though).

    Or am I missing something?

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 13 years ago

    I'm just proposing an alternative that I find cleaner, simpler and easier to maintain.

    I understand how LD_PRELOAD works but I find it neither clean nor simple to maintain.

    Also by using a wrapper to call Python you still impact all the applications that may be executed from Python since the environment variables are propagated. You also need to configure the path to the alternative malloc library at runtime.

    And as I said above, I am afraid most AIX and SunOS users will never hear about that and will just use the default behavior, with their Python application taking much more memory than necessary.

    As mentioned by Floris, AIX is being pushed by IBM quite a lot, and in some markets it is very common (if not predominant in finance for example - 50/50 with SunOS for my clients I would say).

    I mean, if you start embedding malloc in python, why stop there, and not embed the whole glibc ;-)

    Concerning AIX, that would not be such a bad idea given the number of bugs in the native C library (cf some of my other issues reported in python bug tracker) - just kidding ;-)

    Concerning the fact that dlmalloc or ptmalloc evolve "quickly":

    Also an old dlmalloc is better than no dlmalloc at all. And as you noticed, an old dlmalloc is already provided in libffi.

    So how about a --with-dlmalloc=path/to/dlmalloc.c?

    That looks like a good alternative. I can implement that if that can help to get the patch in Python.

    Can't you just add dlmalloc to LDFLAGS or something? Or would the default malloc still be selected?

    There is a USE_DL_PREFIX in malloc.c. If this flag is defined, all functions will be prefixed by dl (dlmalloc, dlfree, dlrealloc...). If it is not set, the functions will be named as usual (malloc, free...).

    In my patch, I preferred to set USE_DL_PREFIX and call dlmalloc/dlfree explicitly where needed.

    Since I want PyMem_MALLOC to call dlmalloc, I would need to export the "malloc" symbol from libpython so that Python extensions could use it when calling PyMem_MALLOC, but that would impact all malloc calls in applications which embed Python for example.

    So I think it is probably better to explicitly distinguish when you want to call dlmalloc and leave the native malloc for the host application.

    Also this only addresses the --with-dlmalloc part of my patch. The other part concerning --with-pymalloc-mmap ensures that pymalloc uses mmap to allocate arenas rather than malloc.

    I perfectly understand that people are reluctant to make the memory allocation system more complex than it is already in Python in order to bypass some limitations of systems which are not very widespread among Python users.

    But Python eating a lot of memory on SunOS and AIX does not look very good either.

    I have some strong requirements as far as memory is concerned for my application so I have maintained this patch internally and distributed it as part of my application.

    I will probably change of job soon and will not have access to AIX systems anymore. I don't really expect this patch to be accepted soon as few people have expressed some interest and I don't have much time/interest to push it on python-dev, but I will update the patch for Python 2.7 and 3.2 before leaving so that people impacted by this problem could a least manually patch their Python if they find this issue.

    pitrou commented 13 years ago

    Since I want PyMem_MALLOC to call dlmalloc, I would need to export the "malloc" symbol from libpython so that Python extensions could use it when calling PyMem_MALLOC, but that would impact all malloc calls in applications which embed Python for example.

    Well, that would be a rather good thing. There are, IIRC, Python API calls which require that the caller manually frees memory. If the API call malloc()s memory with a certain allocator and the caller free()s it with another allocator, the result won't be pretty :)

    (a similar discrepancy occurs between function-based APIs and macro-based APIs: functions get compiled inside the Python library while macros get compiled within the embedding executable; if library and application have an incompatible malloc()/free() pair, you will get similarly funny results)

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 13 years ago

    Yes, I was probably not clear: When --with-dlmalloc is activated, PyMem_MALLOC/PyMem_Malloc will call dlmalloc, PyMem_REALLOC/PyMem_Realloc will call dlrealloc and PyMem_FREE/PyMem_Free will call dlfree.

    While calls to malloc/free/realloc will use the platform implementation.

    So I think there should not be any mix, since as it is mentioned in pymem.h, people should not mix PyMem_MALLOC/PyMem_FREE with malloc/free:

    /* BEWARE:

    Each interface exports both functions and macros. Extension modules should use the functions, to ensure binary compatibility across Python versions. Because the Python implementation is free to change internal details, and the macros may (or may not) expose details for speed, if you do use the macros you must recompile your extensions with each Python release.

    Never mix calls to PyMem_ with calls to the platform malloc/realloc/ calloc/free. For example, on Windows different DLLs may end up using different heaps, and if you use PyMem_Malloc you'll get the memory from the heap used by the Python DLL; it could be a disaster if you free()'ed that directly in your own extension. Using PyMem_Free instead ensures Python can return the memory to the proper heap. As another example, in PYMALLOCDEBUG mode, Python wraps all calls to all PyMem and PyObject_ memory functions in special debugging wrappers that add additional debugging info to dynamic memory blocks. The system routines have no idea what to do with that stuff, and the Python wrappers have no idea what to do with raw blocks obtained directly by the system routines then.

    The GIL must be held when using these APIs. */

    pitrou commented 13 years ago

    Yes, I was probably not clear: When --with-dlmalloc is activated, PyMem_MALLOC/PyMem_Malloc will call dlmalloc, PyMem_REALLOC/PyMem_Realloc will call dlrealloc and PyMem_FREE/PyMem_Free will call dlfree.

    While calls to malloc/free/realloc will use the platform implementation.

    I'm not sure why you would want that. If dlmalloc is clearly superior, why not use it for all allocations inside the application (not only Python ones)?

    92935ae4-c5d3-4cd3-81e6-25bec3013308 commented 13 years ago

    On 29 April 2011 17:16, Antoine Pitrou \report@bugs.python.org\ wrote:

    Antoine Pitrou \pitrou@free.fr\ added the comment:

    > Yes, I was probably not clear: > When --with-dlmalloc is activated, PyMem_MALLOC/PyMem_Malloc will call > dlmalloc, PyMem_REALLOC/PyMem_Realloc will call dlrealloc and > PyMem_FREE/PyMem_Free will call dlfree. > > While calls to malloc/free/realloc will use the platform implementation.

    I'm not sure why you would want that. If dlmalloc is clearly superior, why not use it for all allocations inside the application (not only Python ones)?

    For the same reason that extension modules can choose between PyMem_Malloc and plain malloc (or whatever else). Python has never forced it's malloc on extension modules why should it now?

    pitrou commented 13 years ago

    For the same reason that extension modules can choose between PyMem_Malloc and plain malloc (or whatever else). Python has never forced it's malloc on extension modules why should it now?

    We're talking about a platform-specific feature request due to the fact that dlmalloc is (supposedly) superior to AIX malloc(). If it's superior than I don't see any *practical* reason not to want to use it for other purposes than allocating Python objects.

    79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 13 years ago

    Even worse than that, mixing to malloc implementations could lead to trouble. For example, the trimming code ensures that the heap is where it last set it. So if an allocation has been made by another implementation in the meantime, the heap won't be trimmed, and your memory usage won't decrease. Also, it'll increase memory fragmentation. Finally, I've you've got two threads inside different malloc implementations at the same time, well, some really bad things could happen. And there are probably many other reasons why it's a bad idea.

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 13 years ago

    I share the opinion of Floris on this: just because you link your application with python does not mean you want it to handle all memory management.

    If you want the memory to be handled by Python, you should call PyMem_Malloc.

    Otherwise people may want to use different malloc implementations in different parts of their application/libraries for different reasons (dmalloc for debugging http://dmalloc.com/ for example - we have seen that libffi bundles its own dlmalloc - someone may prefer a derivative of ptmalloc for performance reasons with threads...).

    My application is linked with various libraries including libpython, glib and gmp, and I sometimes like to be able to distinguish how much memory is allocated by which library for profiling/debugging purpose for example.

    I don't understand the point concerning trimming/fragmentation/threading by Charles-Francois: dlmalloc will allocate its own memory segment using mmap and handle memory inside that segment when you do a dlmalloc/dlfree/dlrealloc. Other malloc implementations will work in their own separate space and so won't impact or be impacted by what happens in dlmalloc segments.

    dlmalloc is not that much different from pymalloc in that regard: it handles its own memory pool on top of the system memory implementations. Yet you can have an application that uses the ordinary malloc while calling some Python code which uses pymalloc without any trimming/fragmentation/threading issues.

    79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 13 years ago

    I don't understand the point concerning trimming/fragmentation/threading by Charles-Francois: dlmalloc will allocate its own memory segment using mmap and handle memory inside that segment when you do a dlmalloc/dlfree/dlrealloc. Other malloc implementations will work in their own separate space and so won't impact or be impacted by what happens in dlmalloc segments.

    Most of the allocations come from the heap - through sbrk - which is a shared resource, and is a contiguous space. mmap is only used for big allocations.

    dlmalloc is not that much different from pymalloc in that regard: it handles its own memory pool on top of the system memory implementations. Yet you can have an application that uses the ordinary malloc while calling some Python code which uses pymalloc without any trimming/fragmentation/threading issues.

    It's completely different. Pymalloc is used *on top* of libc's malloc, while dlmalloc would be be used in parallel.

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 13 years ago

    Another reason why you should not force dlmalloc for all applications linked with libpython is because dlmalloc is (by default) not thread safe, while the system malloc is (generally) thread-safe. It is possible to define a constant in dlmalloc to make it thread-safe (using locks) but it will be slower and it is not needed in Python since the GIL must be held when using PyMem_ functions.

    If a thread-safe implementation was needed, it would be better to switch to ptmalloc2.

    Also that addresses the issue of "two threads inside different malloc implementations at the same time": it is currently not allowed with PyMem_Malloc.

    Most of the allocations come from the heap - through sbrk

    Most python objects will be allocated in pymalloc arenas (if they are smaller than 256 bytes) which (if compiled with --with-pymalloc-mmap) will be directly allocated by calling mmap, or (without --with-pymalloc-mmap) will be allocated in dlmalloc by calling mmap (because arenas are 256KB). So most of the python objects will end up in mmap segments separate from the heap.

    The only allocations that will end up in the heap are for the medium python objects (>256 bytes and \<256KB) or for allocations directly by calling PyMem_Malloc (and for a size \<256KB). Also dlmalloc will not call sbrk for each of those allocations: dlmalloc allocates some large memory pools and manage the smaller allocations within those pools in a very efficient way. So the heap fragmentation should be indeed reduced by using dlmalloc.

    Most modern malloc implementations are also using pools/arenas anyway, so the heap will mostly contain a mix of native malloc arenas and dlmalloc pools. So the fragmentation should not be too much of a concern if you mix 2 malloc implementations. Here is OpenSolaris malloc implementation for example: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libmalloc/common/malloc.c#514

    Concerning trimming: the reason why I am proposing to use dlmalloc on AIX and Solaris is that the native malloc/free do not correctly trim the heap in the first place on those platforms! If malloc/free correctly worked on those platforms and the heap was trimmed when possible, I would not have taken the trouble of proposing this patch and using dlmalloc, I would happily use the native malloc/free.

    So mixing 2 malloc implementations should not be a problem as long as you keep track of the right 'free' implementation to use for each pointer (which should already be the case when you call PyMem_Malloc/PyMem_Free instead of malloc/free).

    If you are really concerned about mixing 2 malloc implementations in the heap, you can define "HAVE_MORECORE 0" in dlmalloc and that way dlmalloc will always use mmap and not use the heap at all.

    My application uses the provided patch so that dlmalloc is used for Python objects and the native malloc for all the rest (much less consuming than the Python part) on AIX and SunOS. It has been in production for years and we never experienced any crash related to memory problems.

    79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 13 years ago

    Also that addresses the issue of "two threads inside different malloc implementations at the same time": it is currently not allowed with PyMem_Malloc.

    That's not true. You can perfectly have one thread inside PyMem_Malloc while another one is inside libc's malloc. For example, posix_listdir does:

     Py_BEGIN_ALLOW_THREADS
     dirp = opendir(name);
     Py_END_ALLOW_THREADS

    Where opendir calls malloc internally. Since the GIL is released, you can have another thread inside PyMem_Malloc at the same time. This is perfectly safe, as long as the libc's malloc version is thread-safe.

    But with your patch, such code wouldn't be thread-safe anymore. This patch implies that a thread can't call malloc directly or indirectly (printf, opendir, and many others) while it doesn't hold the GIL. This is going to break a lot of existing code. This thread-safety issue is not theoretical: I wrote up a small program with two threads, one allocating/freeing memory in loop with glibc's malloc and the other one with dlmalloc: it crashes immediately on a Linux box.

    Most python objects will be allocated in pymalloc arenas (if they are smaller than 256 bytes) which (if compiled with --with-pymalloc-mmap) will be directly allocated by calling mmap, or (without --with-pymalloc-mmap) will be allocated in dlmalloc by calling mmap (because arenas are 256KB). So most of the python objects will end up in mmap segments separate from the heap.

    The only allocations that will end up in the heap are for the medium python objects (>256 bytes and \<256KB) or for allocations directly by calling PyMem_Malloc (and for a size \<256KB).

    Note that there are actually many objects falling into this category: for example, on 64-bit, a dictionary exceeds 256B, and is thus allocated directly from the heap (well, it changed really recently actually), the same holds for medium-sized lists and strings. So, depending on your workload, the heap can extend and shrink quite a bit.

    If you are really concerned about mixing 2 malloc implementations in the heap, you can define "HAVE_MORECORE 0" in dlmalloc and that way dlmalloc will always use mmap and not use the heap at all.

    It will also be slower, and consume more memory.

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 13 years ago

    Sorry for the very late reply; I have been quite busy recently with the birth of my second daughter, a new job, a new home town and soon a new home.

    ...

    But with your patch, such code wouldn't be thread-safe anymore. This patch implies that a thread can't call malloc directly or indirectly (printf, opendir, and many others) while it doesn't hold the GIL. This is going to break a lot of existing code.

    I didn't have this problem since the threads in my application are handled by Python and so hold the GIL. But you are right it is a concern.

    Fortunately, it is easy to solve by defining the following in dlmalloc:
    #define HAVE_MORECORE 0

    That way, all the memory allocations handled by Python will go in a dedicated mmaped memory segment controlled by dlmalloc, while all the calls to the system malloc will work as before (probably going into a segment handled by sbrk).

    It will also be slower, and consume more memory.

    It should be noted that sbrk is deprecated on some platforms where mmap is suggested as a better replacement (Mac OS X, FreeBSD...). sbrk is generally considered quite archaic.

    I attach a new patch that can be applied to Python 2.7.1. It includes the dlmalloc modification and uses only mmap in this case (no sbrk).

    We have delivered it in production with the new version of our software that works on AIX 6.1 and it works fine.

    I also did some benchmarks and did not notice any slow down compared to a pristine Python 2.7.1 (actually it was slightly faster YMMV). It also consumes a lot less memory, but that is the reason for this patch in the first place.

    Since I am changing of job, I won't be working on AIX anymore (yeah!); I also don't expect this patch to be integrated spontaneously without someone interested in AIX pushing for it. So I leave this patch more as a reference for someone who would be impacted by this problem and would like to integrate it in his own Python. I hope it helps.

    pitrou commented 13 years ago

    Since I am changing of job, I won't be working on AIX anymore (yeah!);

    You seem happy about that :) Does it mean the project to have an AIX buildbot is abandoned?

    I also don't expect this patch to be integrated spontaneously without someone interested in AIX pushing for it. So I leave this patch more as a reference for someone who would be impacted by this problem and would like to integrate it in his own Python. I hope it helps.

    Indeed, thanks for your contributions.

    cf2aeb3c-f43f-407a-8d37-30a2b5033222 commented 13 years ago

    Does it mean the project to have an AIX buildbot is abandoned?

    We have a buildbot running internally on AIX. I could not get the necessary modifications integrated upstream in the official Python buildbot so that we could plug directly on it.

    cf this thread: http://mail.python.org/pipermail/python-dev/2010-October/thread.html#104714

    I will try to get someone at my company to keep this buildbot running and report any outstanding bug, but I can't guarantee anything.

    Indeed, thanks for your contributions.

    Thanks! And thank you for your help in most of the issues related to AIX.

    79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 13 years ago

    Fortunately, it is easy to solve by defining the following in dlmalloc:

    define HAVE_MORECORE 0

    I was expecting this answer ;-) Here's a quick demo, on a Linux box:

    cf@neobox:~/cpython$ ./python Tools/pybench/pybench.py -n 1 ------------------------------------------------------------------------------- Totals: 19787ms 19787ms

    cf@neobox:~/cpython$ MALLOC_MMAPTHRESHOLD=0 ./python Tools/pybench/pybench.py -n 1 [...] ------------------------------------------------------------------------------- Totals: 33375ms 33375ms

    That's a mere 70% slowdown, and without pymalloc, it would be much worse. malloc with mmap() is way slower than with sbrk() (see http://sources.redhat.com/ml/libc-alpha/2006-03/msg00033.html for more details). Since your benchmarks don't show this type of regression it probably means that AIX's malloc implementation is really broken (there's also the fact that part of the allocations are still routed to the libc's malloc, or maybe your workload is too specific to demonstrate this behavior).

    sbrk is generally considered quite archaic.

    I wouldn't say that; see the above link on malloc's dynamic mmap() threshold.

    I also don't expect this patch to be integrated spontaneously without someone interested in AIX pushing for it.

    Indeed. As far as I'm concerned, there are two "showstoppers":

    But I think the main problem with this patch is that AIX represents such a tiny fraction of the user base. This might change in the future, especially if IBM is successfull in its effort of pushing AIX (I hope they'll finally fix AIX's malloc by then...).

    I have been quite busy recently with the birth of my second daughter, a new job, a new home town and soon a new home.

    Congratulations, and good luck!

    pitrou commented 11 years ago

    PEP-445 allows you to customize the Python memory allocators, which is a better solution than shipping several ones with Python ;-)