The moodycamel queue when nthreads >> ncores is really inefficient. Either go back to the std::dequeue implementation, or figure out how to tune performance
// The number of times to spin before sleeping when waiting on a semaphore.
// Recommended values are on the order of 1000-10000 unless the number of
// consumer threads exceeds the number of idle cores (in which case try 0-100).
// Only affects instances of the BlockingConcurrentQueue.
static const int MAX_SEMA_SPINS = 0; //10000;
The moodycamel queue when nthreads >> ncores is really inefficient. Either go back to the std::dequeue implementation, or figure out how to tune performance