Open redmeadowman opened 6 years ago
This effort was simple enough that I decided to tackle it. I have issued a Pull Request for this feature.
Hi, @redmeadowman Why do you need to "wrap these methods with a critical region semaphore, mutex or spinlock."?
Hey gou4shi1,
My application has up to 32 threads, split across 4 NUMA memory zones. They all independently allocate and free stream and scratch areas during runtime. Each NUMA memory zone requires its own allocator in order to assure that only memory from that zone is allocated. So the threads must make a call like this:
hs_set_allocator(Alloc_NUMA[i], Free_NUMA[i]);
hs_alloc_scratch();
This solves the problem of allocating from the correct NUMA memory pool. But there is a serious contention issue if more than one thread tries to perform this operation simultaneously. So my solution for the original Hyperthread memory management design is as follows:
bool allocScratch(const hs_database_t* p_db)
{
Spinlock::Guard(M_spinlock);
hs_set_allocator(Alloc_NUMA[i], Free_NUMA[i]);
return HS_SUCCESS != hs_alloc_scratch(p_db, &m_scratch);
}
The Spinlock::Guard object allows only one thread to pass at a time, and the pointers in the static memory allocators must be changed with each call to hs_alloc_scratch(). As you can see this is not very efficient.
A better way, at least in my application, is to perform memory management external to the Hyperthread library. To do so I've added the following new interfaces:
hs_scratch_size()
hs_clone_scratch_at()
hs_scratch_memaddr()
hs_stream_size()
hs_open_stream_at()
hs_close_stream_nofree()
Using these functions the above example can be changed to this:
bool allocScratch(const hs_database_t* p_db)
{
size_t alloc_size;
hs_scratch_size(m_base_scratch, &alloc_size);
hs_clone_scratch *clone = Alloc_NUMA[i](alloc_size);
hs_clone_scratch_at(m_base_scratch, &clone);
}
Notice the absence of the costly Spinlock and call to hs_set_allocator().
While investigating Hyperscan for use in a DPI system I've noticed that the current memory allocation method will create a bottleneck for multi-threaded/multi-streamed NUMA (Non-Uniform Memory Architectures) systems, such as Linux. The bottleneck is created around the four static functions in the file alloc.c. It requires a mutli-threaded caller to wrap these methods with a critical region semaphore, mutex or spinlock. In addition, the static variables associated with these methods create false-sharing issues in the cache of multi-threaded systems.
Do you plan to improve the Hyperscan interface to improve memory management?
Possible solutions could include:
Regards, Daris Nevil