Improved memory managment for NUMA architectures

redmeadowman commented 6 years ago

While investigating Hyperscan for use in a DPI system I've noticed that the current memory allocation method will create a bottleneck for multi-threaded/multi-streamed NUMA (Non-Uniform Memory Architectures) systems, such as Linux. The bottleneck is created around the four static functions in the file alloc.c. It requires a mutli-threaded caller to wrap these methods with a critical region semaphore, mutex or spinlock. In addition, the static variables associated with these methods create false-sharing issues in the cache of multi-threaded systems.

Do you plan to improve the Hyperscan interface to improve memory management?

Possible solutions could include:

Functions to create and place scratch and stream objects at specific memory locations, much like the hs_deserialize_at() function
Create a new pair of functions for scratch and stream objects, one for discovering the size required for the allocation, and the second for placement of the object into a memory buffer
Provide a C++ interface that accepts a virtualized memory management object as a parameter to the creation functions

Regards, Daris Nevil

redmeadowman commented 6 years ago

This effort was simple enough that I decided to tackle it. I have issued a Pull Request for this feature.

gou4shi1 commented 6 years ago

Hi, @redmeadowman Why do you need to "wrap these methods with a critical region semaphore, mutex or spinlock."?

redmeadowman commented 6 years ago

Hey gou4shi1,

My application has up to 32 threads, split across 4 NUMA memory zones. They all independently allocate and free stream and scratch areas during runtime. Each NUMA memory zone requires its own allocator in order to assure that only memory from that zone is allocated. So the threads must make a call like this:

hs_set_allocator(Alloc_NUMA[i], Free_NUMA[i]);
hs_alloc_scratch();

This solves the problem of allocating from the correct NUMA memory pool. But there is a serious contention issue if more than one thread tries to perform this operation simultaneously. So my solution for the original Hyperthread memory management design is as follows:

bool allocScratch(const hs_database_t* p_db)
{
    Spinlock::Guard(M_spinlock);
    hs_set_allocator(Alloc_NUMA[i], Free_NUMA[i]);
    return HS_SUCCESS != hs_alloc_scratch(p_db, &m_scratch);
}

The Spinlock::Guard object allows only one thread to pass at a time, and the pointers in the static memory allocators must be changed with each call to hs_alloc_scratch(). As you can see this is not very efficient.

A better way, at least in my application, is to perform memory management external to the Hyperthread library. To do so I've added the following new interfaces:

hs_scratch_size()
hs_clone_scratch_at()
hs_scratch_memaddr()
hs_stream_size()
hs_open_stream_at()
hs_close_stream_nofree()

Using these functions the above example can be changed to this:

bool allocScratch(const hs_database_t* p_db)
{
    size_t alloc_size;
    hs_scratch_size(m_base_scratch, &alloc_size);
    hs_clone_scratch *clone = Alloc_NUMA[i](alloc_size);
    hs_clone_scratch_at(m_base_scratch, &clone);
}

Notice the absence of the costly Spinlock and call to hs_set_allocator().

intel / hyperscan

Improved memory managment for NUMA architectures #108