python / cpython

The Python programming language
https://www.python.org
Other
62.12k stars 29.86k forks source link

Add PyMem_AlignedAlloc() #63035

Closed rhettinger closed 6 years ago

rhettinger commented 11 years ago
BPO 18835
Nosy @tim-one, @rhettinger, @pitrou, @vstinner, @benjaminp, @tpn, @njsmith, @skrah, @xdegaye, @wscullin
PRs
  • python/cpython#4089
  • python/cpython#4200
  • python/cpython#15333
  • Files
  • align.diff: Use case for alignment functions and macros
  • pymem_alignedalloc.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', 'type-feature', '3.7'] title = 'Add PyMem_AlignedAlloc()' updated_at = user = 'https://github.com/rhettinger' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'none' closed = True closed_date = closer = 'vstinner' components = ['Interpreter Core'] creation = creator = 'rhettinger' dependencies = [] files = ['31543', '47285'] hgrepos = [] issue_num = 18835 keywords = ['patch'] message_count = 57.0 messages = ['196174', '196176', '196194', '196196', '196202', '196693', '196700', '196713', '196828', '196834', '217229', '232201', '232209', '232215', '232219', '232221', '232222', '232223', '232225', '234086', '286074', '304803', '304805', '304807', '304826', '304834', '304837', '304843', '304844', '304859', '304878', '304989', '305003', '305009', '305012', '305114', '305122', '305136', '305160', '305177', '305301', '305302', '305304', '305308', '305310', '305311', '305327', '305331', '305339', '305344', '305345', '305366', '305397', '305417', '305418', '305466', '306864'] nosy_count = 11.0 nosy_names = ['tim.peters', 'rhettinger', 'pitrou', 'vstinner', 'benjamin.peterson', 'trent', 'njs', 'skrah', 'neologix', 'xdegaye', 'wscullin'] pr_nums = ['4089', '4200', '15333'] priority = 'normal' resolution = 'rejected' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18835' versions = ['Python 3.7'] ```

    vstinner commented 6 years ago

    msg196194: Antoine Pitrou: "Note that the current small object allocator, if not disabled, *should* already return you aligned memory, by construction (each allocation size has dedicated pools from which memory blocks are carved)."

    In my current implementation of _PyObject_AlignedAlloc(), I only call pymalloc_alloc() for alignment \<= pymalloc ALIGNMENT. Maybe we can do better?

    pitrou commented 6 years ago

    I agree, if the two primary users say they won't use it, it doesn't make sense for us to maintain 600 hundred lines of low-level C code.

    5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

    Yes, it may be better not to add it.

    To summarize the problems again: --------------------------------

    The feature could still be useful for fixing bpo-31912 and bpo-27987, *if* someone has an idea how to integrate the aligned version.

    vstinner commented 6 years ago

    "C11 aligned_alloc() / free() would be more comfortable but isn't available on MSVC."

    Is it common to require the allocated memory block to be filled with zeros? In Python, calloc() is not the most common allocator. I only found 5 places where calloc is used:

    Modules/_ssl.c:5172: _ssl_locks = PyMem_Calloc(_ssl_locks_count, Modules/gcmodule.c:1689: g = (PyGC_Head *)PyObject_Calloc(1, size); Modules/gcmodule.c:1717:_PyObject_GC_Calloc(size_t basicsize) Objects/bytesobject.c:83: op = (PyBytesObject *)PyObject_Calloc(1, PyBytesObject_SIZE + size); Objects/listobject.c:176: op->ob_item = (PyObject **) PyMem_Calloc(size, sizeof(PyObject *));

    "posix_memalign() performance isn't that great. hand-rolled aligned_calloc() is the fastest."

    I'm not sure that the cost of the memory allocator itself defeats the gain of aligned memory on algorithms. I expect data processing to be much more expensive than the memory allocation, no?

    Again, the unknown remains the benchmark result.

    Can anyone write a quick benchmark to measure the gain of aligned memory? 4 years ago, Raymond Hettinger wanted to use it for the set and collection.deque types.

    vstinner commented 6 years ago

    About the API itself, I'm not sure that PyMem_AlignedAlloc(alignment, size) is flexible enough. If we want to get *data* aligned in a Python object, we would have to pass an offset to the data, since Python objects have headers of variable size (depending on the type).

    Windows has such API:

    void * _aligned_offset_malloc(  
       size_t size,   
       size_t alignment,   
       size_t offset  
    );  

    This function is based on malloc, so likely adds padding bytes for you depending on size, alignment and offset.

    https://msdn.microsoft.com/fr-fr/library/ec852tkw.aspx

    See bpo-27987: "obmalloc's 8-byte alignment causes undefined behavior".

    5531d0d8-2a9c-46ba-8b8b-ef76132a492c commented 6 years ago

    I'm not sure that the cost of the memory allocator itself defeats the gain of aligned memory on algorithms. I expect data processing to be much more expensive than the memory allocation, no?

    I guess this issue isn't easy to focus due to the vast variety of use cases. So the is only about numpy/ndtypes:

    What you write is true, but I'm simply getting cold feet w.r.t locking myself into memset(). calloc() uses mmap() for large allocations, so I think one can happily allocate a large number of huge arrays without any cost on Linux, as long as they're not accessed.

    At least that's what my tests indicate.

    Couple that with the fact that one has to use aligned_free() anyway, and that posix_memalign() isn't that great, and the use case seems less solid **for scientific computing**.

    So I rather waste a couple of bytes per allocation and deal with some Valgrind macros to get proper bounds checking.

    Note that CPython still knows any allocation from ndtypes, because ndt_callocfunc will be set to PyMem_Calloc() [1] and the custom ndt_aligned_calloc() uses ndt_callocfunc.

    [1] If bpo-31912 is solved.

    vstinner commented 6 years ago

    Because of the lack of strong motivition compared to the size of the change and the cost to maintain it, I reject this issue. For the archives, I attach a copy of the latest version of my implementation.