Benchmark 1.6 threadpool on BE2.0

simsong commented 2 years ago

After testing on a 16-core machine, I have validated that BE1.6 appears to have better threading utilization (with roughly 100% cpu saturation) compared with 2.0beta2, which uses https://github.com/bshoshany/thread-pool.

Here are 1.6 runs on domexusers:

runs-b16-16core/report-12-1635690009.xml:    <clocktime>408.010752</clocktime>
runs-b16-16core/report-12-1635693505.xml:    <clocktime>409.501960</clocktime>
runs-b16-16core/report-12-1635697002.xml:    <clocktime>410.108173</clocktime>
runs-b16-16core/report-16-1635690420.xml:    <clocktime>370.629652</clocktime>
runs-b16-16core/report-16-1635693917.xml:    <clocktime>371.266095</clocktime>
runs-b16-16core/report-16-1635697415.xml:    <clocktime>370.234392</clocktime>
runs-b16-16core/report-20-1635690793.xml:    <clocktime>363.861724</clocktime>
runs-b16-16core/report-20-1635694292.xml:    <clocktime>363.992958</clocktime>
runs-b16-16core/report-20-1635697788.xml:    <clocktime>364.127595</clocktime>
runs-b16-16core/report-24-1635691160.xml:    <clocktime>357.277975</clocktime>
runs-b16-16core/report-24-1635694659.xml:    <clocktime>357.167783</clocktime>
runs-b16-16core/report-24-1635698156.xml:    <clocktime>357.461092</clocktime>
runs-b16-16core/report-28-1635691521.xml:    <clocktime>353.313595</clocktime>
runs-b16-16core/report-28-1635695019.xml:    <clocktime>352.418933</clocktime>
runs-b16-16core/report-28-1635698516.xml:    <clocktime>352.639296</clocktime>
runs-b16-16core/report-32-1635691877.xml:    <clocktime>348.289225</clocktime>
runs-b16-16core/report-32-1635695375.xml:    <clocktime>348.579766</clocktime>
runs-b16-16core/report-32-1635698873.xml:    <clocktime>349.397977</clocktime>
runs-b16-16core/report-4-1635688729.xml:    <clocktime>805.400085</clocktime>
runs-b16-16core/report-4-1635692229.xml:    <clocktime>801.918127</clocktime>
runs-b16-16core/report-4-1635695726.xml:    <clocktime>799.184008</clocktime>
runs-b16-16core/report-8-1635689537.xml:    <clocktime>468.619099</clocktime>
runs-b16-16core/report-8-1635693034.xml:    <clocktime>467.846562</clocktime>
runs-b16-16core/report-8-1635696528.xml:    <clocktime>469.830018</clocktime>

Here are 2.0beta2 runs:

runs-be20-16core/report-12-1635643393.xml:    <clocktime>861.518245</clocktime>
runs-be20-16core/report-12-1635647778.xml:    <clocktime>865.312949</clocktime>
runs-be20-16core/report-12-1635652155.xml:    <clocktime>861.142371</clocktime>
runs-be20-16core/report-16-1635644257.xml:    <clocktime>784.363627</clocktime>
runs-be20-16core/report-16-1635648646.xml:    <clocktime>780.594149</clocktime>
runs-be20-16core/report-16-1635653019.xml:    <clocktime>783.836739</clocktime>
runs-be20-16core/report-20-1635678442.xml:    <clocktime>779.738431</clocktime>
runs-be20-16core/report-20-1635681527.xml:    <clocktime>780.924977</clocktime>
runs-be20-16core/report-24-1635679224.xml:    <clocktime>765.451442</clocktime>
runs-be20-16core/report-24-1635682311.xml:    <clocktime>765.890413</clocktime>
runs-be20-16core/report-28-1635679993.xml:    <clocktime>767.249245</clocktime>
runs-be20-16core/report-32-1635680763.xml:    <clocktime>761.820637</clocktime>
runs-be20-16core/report-4-1635640659.xml:    <clocktime>1736.793448</clocktime>
runs-be20-16core/report-4-1635645044.xml:    <clocktime>1741.427413</clocktime>
runs-be20-16core/report-4-1635649429.xml:    <clocktime>1734.991584</clocktime>
runs-be20-16core/report-4-nps-2009-domexusers.E01-1635633654.xml:    <clocktime>782.923933</clocktime>
runs-be20-16core/report-4-tmp2-1635635030.xml:    <clocktime>1741.146956</clocktime>
runs-be20-16core/report-8-1635642399.xml:    <clocktime>992.185927</clocktime>
runs-be20-16core/report-8-1635646788.xml:    <clocktime>987.261979</clocktime>
runs-be20-16core/report-8-1635651166.xml:    <clocktime>986.072297</clocktime>

What this isn't showing is the CPU performance over time. I visually saw the results on the system load, but was not able to capture it. See next issue.

[ ] Port BE1.6 threadpool to BE2.0 and see if performance improves.
[ ]

jonstewart commented 2 years ago

So <clocktime> here is overall elapsed "wall clock" time? Yikes!

I had a look at the thread pool. I'm a bit skeeved out by it, to be honest:

    void sleep_or_yield()
    {
        if (sleep_duration)
            std::this_thread::sleep_for(std::chrono::microseconds(sleep_duration));
        else
            std::this_thread::yield();
    }

    /**
     * @brief A worker function to be assigned to each thread in the pool. Continuously pops tasks out of the queue and executes them, as long as the atomic variable running is set to true.
     */
    void worker()
    {
        while (running)
        {
            std::function<void()> task;
            if (!paused && pop_task(task))
            {
                task();
                tasks_total--;
            }
            else
            {
                sleep_or_yield();
            }
        }
    }

So, I'm more familiar with a normal condition_variable-based threadpool, where the CV is used to notify threads when there's work to do. Instead, this implementation has the worker thread either yield or sleep, and then poll the task queue again. std::this_thread::yield()'s behavior is supposedly to yield to other threads in the process with the same priority. If that happens, then we can assume that the thread is suspended for a set period of time (perhaps more time than we want, i.e., work may show up but time-slicing keeps the thread from waking up, but not so long). However, bulk_extractor doesn't have so many more threads than the hardware supports, so it might not yield at all! In that case, this is a spin lock, and will generate a bunch of contention on the mutex--maybe making it harder to enqueue new tasks.

But it gets worse! If sleep_duration is nonzero, then the worker thread sleeps, not yields. What's the default value of sleep_duration? A whole second! See line 278 of thread_pool.hpp. Woof! And a search in the code base for sleep_duration doesn't turn up any usage of it outside of the thread pool, so this is probably what's happening.

However! Without profiling/timing these different occurrences, it's hard to know what's going on. We can reason about possible outcomes, but I don't feel like I have intuition about what's really happening. Is data being read fast enough?

One thing that you could do, and it would help regardless of which thread_pool you used, would be to change the task functor to work on a batch of sbufs, rather than have a task per sbuf. In a situation where you're CPU-bound and the queue is backed up (likely), this lets the threads work full throttle on sbufs and lessens contention on the mutex.

Since it's a one-line change, I'd also try setting sleep_duration = 0; somewhere and seeing whether that makes a difference.

--

Oh wait Did I misread the time duration on the sleep as 1000 milliseconds when it's actually 1000 microseconds? Hmm.

simsong commented 2 years ago

Thanks for looking at this. I've concluded that I need to throw out the new threadpool and going back to the old one. DO you want to try porting over the BE1.6 threadpool tmake it a plug-in replacement for what we have in BE2.0? I should have the instrumentation done shortly. Check this out:

jonstewart commented 2 years ago

I will take a look. Why do you think it’s the threadpool that’s to blame, though?

Sent from my iPhone

On Nov 3, 2021, at 10:18 PM, Simson L. Garfinkel @.***> wrote:

Thanks for looking at this. I've concluded that I need to throw out the new threadpool and going back to the old one. DO you want to try porting over the BE1.6 threadpool tmake it a plug-in replacement for what we have in BE2.0? I should have the instrumentation done shortly. Check this out:

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

simsong commented 2 years ago

I will take a look. Why do you think it’s the threadpool that’s to blame, though?

Because when I looked at CPU utilization on AWS Console, it was not consistently at 100%. I'm modifying the DFXML writer so that CPU utilization is captured. Then we will be able to graph it at our leisure. It's a good update. But it turns out that it's hard to get % cpu on Mac. So I need to add that stuff too.

simsong commented 7 months ago

This has been resolved.

simsong / bulk_extractor

Benchmark 1.6 threadpool on BE2.0 #289