thomasmoelhave / tpie

Templated Portable I/O Environment
Other
112 stars 24 forks source link

I/O stack performance #238

Open gijsde1ste opened 4 years ago

gijsde1ste commented 4 years ago

Hi all,

As some of you might know by now I'm attempting to setup an experiment to measure performance between internal and IO efficient algorithms. I'm trying to get some early benchmarks how tpies IO stack performance compares to the internal stack. See this very bare bone code example:

void testInternalStack(){
    std::cout << "Testing internal stack" << std::endl;
    tpie::internal_stack<double> s = tpie::internal_stack<double>(500242880);

    for (int i = 0; i < 500242880; i++){
        s.push(i);
    }

    for (int i = 0; i < 500242880; i++){
        s.pop();
    }
}

void testIOStack(){
    std::cout << "Testing IO stack" << std::endl;
    tpie::stack<double> s = tpie::stack<double>();

    for (int i = 0; i < 500242880; i++){
        s.push(i);
    }

    for (int i = 0; i < 500242880; i++){
        s.pop();
    }
}

int main() {
    tpie::tpie_init();

    size_t available_memory_mb = 128;
    tpie::get_memory_manager().set_limit(available_memory_mb*1024*1024);

    auto start = std::chrono::high_resolution_clock::now();
    testIOStack();
    //testInternalStack();
    auto stop = std::chrono::high_resolution_clock::now();

    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop-start);
    std::cout << duration.count() << std::endl;

    tpie::tpie_finish();

    return 0;
}

In order to limit the available ram I've set up two cgroups. 128Group with the intended test size of 128mb ram and unlimited (10gb) swap. And a control group unlimitedGroup with 8gb ram and 10gb swap. As the images below show there is no real difference between running the program normally or with the unlimited cgroup limitation, therefore its not cgroup that has some performance overhead.

The internal stack behaves as expected, it's very quick when enough ram is available ~1.2 seconds to add and pop 3gb of elements. When the ram becomes a problem it slows down to ~25 seconds because of swap memory.

The IO stack however behaves somewhat unexpected, when ram is not an issue it takes ~14-15 seconds, which is logical since it does some IO. As you can see there is no TPIE warning that the 128mb limit set in tpie is exceeded. The 14-15 seconds is faster than the 25 seconds of the internal stack when that was limited to 128mb, which is good. But the weird thing is that when I use the 128mb cgroup limitation on the IO stack it becomes a lot slower, ~64 seconds which indicates it is doing much more IO's than when there is no limit. This should not be the case?

What I've already tried is playing with the swappiness factor, the OS starts swapping out ram before it is full (otherwise it would be stalling too much) but even when I set swappiness to 0 (only swap when absolutely necessary) I get the same results. While monitoring the cgroup stats I see the swap memory is actively used while running this benchmark.

Can anyone shed some light on why the IO efficient stack performance changes by how much ram there is available?

Images/results of benchmark:

internal_stack IO_stack

adament commented 4 years ago

As a quick test have you tried setting memory available to something less than 128, e.g. 64? Just to test whether the problem might be that you overestimate the amount of memory available to tpie?

SSoelvsten commented 2 years ago

@gijsde1ste has been inactive on GitHub for more than a year and left this issue without progress for two years. Maybe it should just be closed?