Open caspervdw opened 2 years ago
Probably some benefit to this in PostGIS too, getting inside the MemoryContext stucture means things like configured memory upper limits can be enforced, and small allocations (should) be a little faster.
I just realised that the Python allocator will require the GIL (global interpreter lock) to be held. I missed that detail when scanning the docs earlier.
So for the shapely usecase this will probably be a reason to keep using the system allocator. Or am I missing something @jorisvandenbossche
Yes, that indeed seems a reason we wouldn't want to set up GEOS with Python's memory allocator in Shapely. There is also a Raw Memory Interface that does not require the GIL, but it seems this is just a plain wrapper around malloc
et al, so using this wouldn't give any advantage (it's rather meant for overriding this in Python with your custom allocator).
That said, I think this would still be an interesting enhancement to explore. Whether it is for external users of GEOS to plug in their own allocator (like PostGIS might do), or whether it is to experiment with different memory allocators in GEOS itself. (I would assume that the leg work for both use cases would largely overlap?)
For example, in the Arrow C++ project, we have a configurable MemoryPool (header), and have implementations based on the system allocator, jemalloc and mimalloc. But we default to jemalloc / mimalloc (depending on the OS) because those give better performance (if enabled at build time).
In the GEOS context, a custom memory allocator would require all references to C++ containers to be changed from std::vector<T>
to std::vector<T, my_allocator>
(cf https://stackoverflow.com/a/826635). For example the std::vector<Coordinate> vect
member of the CoordinateArraySequence
class. This could be a bit of an invasive change.
In Arrow C++, as far as I can see, the MemoryPool infrastructure is only used for "big" allocations (that is for the content of Arrow arrays), that are done manually through it, but I don't see it used for standard C++ containers (probably because they don't consume a lot of memory)
I would be seriously interested in attempting to pick this up. Are there any other challenges that immediately comes to mind? I suppose the c-api would have to be extended as well.
Not to my mind. I looked into it a while ago, and to my untrained eye this looked like a reasonable way to do the implementation: https://github.com/aws/aws-sdk-cpp/tree/bb1fdce01cc7e8ae2fe7162f24c8836e9d3ab0a2/aws-cpp-sdk-core/include/aws/core/utils/memory
Providing custom memory management functions would allow dependent libraries to use their memory management policies. Especially when small amount of bytes are repeatedly allocated/deallocated, caching/pooling may show a decent speedup.
Also out-of-memory situations may be handled more gracefully (specifically for pygeos/shapely, we could throw an error instead of going in to swap, see https://github.com/shapely/shapely/issues/1349)
Ref. https://trac.osgeo.org/geos/ticket/540
One way to set up such a C interface is well described in the Numpy docs (NEP49): https://numpy.org/neps/nep-0049.html and another one in the Python docs: https://docs.python.org/3/c-api/memory.html#memoryoverview
@jorisvandenbossche @hobu