Memory allocation improvements

There are several memory allocation-related features which could improve performance in situations where performance is limited by allocation overhead:

1) Constant-time allocation lookup: The current memory allocator is designed to support block allocations (i.e. large allocations that manage the memory for many elements in a sequence). Scanner supports kernels which return output elements that point to arbitrary locations within a block allocation. However, this means that determining which allocation a pointer belongs to requires a linear search through all allocations to find the proper range. The reason for this is that the result of C++ pointer comparisons between pointers in different memory allocations is undefined, and thus we can't rely on ordering the pointers for accelerating lookups. Technically, this means that even performing a linear search through all allocations might have false positives and so our current strategy shouldn't work at all, but in practice we've found that compilers do not induce these false positives.

Solution 1: Change the Element structure to include a handle to the block allocation. This means we can avoid having to perform a search for the allocation, removing the reliance on undefined behavior and significantly accelerating the memory allocator when there are a large number of allocations. This would also require changing the new_block_buffer (and new_buffer?) interface to return a vector of Elements that contain the handle to the block allocation instead of returning raw pointers. Unfortunately, this is a breaking change and means updating every Scanner kernel.

2) Allocation pools/ring buffers: Currently, all allocation of output Elements in Scanner kernels dynamically allocate their memory on every call to new_buffer or new_block_buffer. This can introduce a lot of overhead when the processing in the kernel is fairly light. In many cases, the size of the output allocation from a kernel is either fixed (e.g. outputting a frame of a fixed size) or bounded. In those cases, dynamic allocation is unnecessary and a more efficient implementation would be to provide the kernels upfront with their output buffers. This gives control over those allocations to the Scanner runtime and permits a more efficient implementation. For example, we could implement slab allocators or ring buffers that reuse allocations across different invocations of the kernel. This change would require 1) kernels to be able to specify that they have fixed or bounded output sizes and 2) for these kernels to receive their pre-allocated buffers as input to the kernel, instead of relying on new_buffer to perform dynamic allocation.

scanner-research / scanner

Memory allocation improvements #227