Performance of algorithms over a gsl::span

MikeGitb commented 7 years ago

Does anyone have any real-world data on how using gsl::span affects overall application performance? I wrote a few ver simple micro benchmarks similar to the following:

#include <iostream>
#include <algorithm>
#include <random>
#include <vector>
#include <chrono>
#include <gsl\span>

using namespace std;
using namespace std::chrono;

int main() {
    //generate random testdata
    std::vector<int> _data(10000);
    std::default_random_engine rg{};

    gsl::span<int> data(_data);
    long long sum = 0;
    auto start = system_clock::now();
    for (size_t i = 0; i < 1000; ++i) {
        //std::generate(_data.begin(), _data.end(), std::ref(rg)); //1)
        //std::generate(data.begin(), data.end(), std::ref(rg));    //2)
        //std::sort(_data.begin(), _data.end());                          //3)
        //std::sort(data.begin(), data.end());                             //4)
        sum += _data.back();
    }
    auto end = system_clock::now();

    std::cout << sum << std::endl;
    std::cout << duration_cast<milliseconds>(end - start).count() << std::endl;
}

I compiled an ran 4 different Versions with VS2015 update 3 in release mode on a Win10 IvyBridge machine and got the following results:

Fill the vector directly with random data (line 1): ~ 60ms
Fill the vector with random data using span (line 2): ~ 85ms
Fill the vector and sort the data directly (line 1+3): ~ 630ms
Fill the vector and sort the data using span (line 2+4): ~1200ms

Assuming there is no systematic error in my benchmark (I don't want to rule that out at all), using gsl::span does incur a significant performance penalty when used with standard algorithms. I can't tell if this is because the compiler is unable to eliminate the range checks or because a span iterator is a bigger object.

I guess my question is, if that overhead has been a problem for anyone in real applications and if/how this should be addressed or if someone might even already have a solution for this. Results from other platforms are also welcomed.

A partial solution that comes to mind would e.g. be to provide wrappers for the standard algorithms which take gsl::span s as parameters but internally pass raw pointers to the stl-algorithms. E.g. for sort:

void sort(gsl::span<int> data) {
    std::sort(data.data(), data.data()+data.size());
}

Of course with all the appropriate template logic.

galik commented 7 years ago

I'm not experiencing anything that dramatic. In my testing there is only a slight overhead to using std::span.

I did three tests, each with a control. The control times how long it takes to shuffle a container and to checksum its contents. The actual test does a shuffle followed by a sort followed by a checksum. I use the time difference between the control and the test as a value representing just the sort time.

For 1000 iterations of 100000 integers I got the following results (in seconds):

vector<int>
sort cpy: 7.76034

vector<int>&
sort ref: 7.99321

span<int>
sort cpy: 8.3961

Sorting the vector through a reference is slightly slower that direct sorting, sorting through a span is slightly slower still. Not by huge amounts though.

Linux: GCC v6.2.1 -O3 -D GSL_UNENFORCED_ON_CONTRACT_VIOLATION

Edit: Just for fun I added a std::array to the mix which is predictably slightly faster:

std::array<int, N>
sort cpy: [0]: 7.7427

std::vector<int>
sort cpy: [0]: 7.79957

std::vector<int>&
sort ref: [0]: 7.9695

std::span<int>
sort cpy: [0]: 8.40562

rianquinn commented 7 years ago

@MikeGitb I was actually going to ask if "-O3" was enabled. I also forgot to ask at CppCon if Expects / Ensures sets branch prediction correctly. Once I get a free chance I wanted to look at that, assuming someone doesn't beat me to it 😄. If everything is optimized correctly, it really should not make that big of a difference.

maikel commented 7 years ago

In contrast to these simple benchmarks, I also wondered how the call to linearize in each deref of a (multi_)span-iterator will slow down production code. I am about to use an adjusted version of strided spans for scientific computations and I will keep a look at this "issue" and see if we personally need to change the internals of the span iterators in our project scope. But I wont draw any premature conclusions before measuring and comparing it in action.

galik commented 7 years ago

Looking at the span_iterator, dereferencing it involves first dereferencing a span* followed by a subscript operation on the result. It may be faster to store a simple pointer to the target data. However that would prevent the detailed bounds checking. I think you would need to store three pointers for the beginning, current and end positions to keep the same level of bounds checking. I suppose you could conditionally include the bounds checking pointers only in builds that enable bounds checking through the GSL_UNENFORCED_ON_CONTRACT_VIOLATION macro.

maikel commented 7 years ago

Sorry for not being accurate enough. I do not have a problem with span_iterator, IMO it is completely fine for one-dimensional spans.

The bounds_iterator for multi-dimensional spans on the other hand stores a multi-dimensional index which gets linearized using the proper strides on every deref, which leads to some computation each time. On the other hand the logic to advance an iterator is simpler this way. Imagine iterating through a multi-dimensional grid, this could still possibly be unnecessary overhead compared to other approaches.

Anyway I just wanted to contribute to the issue, because I think IF there is any fixable performance issue at all it is probably with bounds_iterator.

rianquinn commented 7 years ago

@MikeGitb

I was able to verify that your findings are indeed correct. gsl::span is in fact slower. The numbers that I saw with your example (and -O3 enabled):

Fill the vector directly with random data (line 1): ~ 40ms
Fill the vector with random data using span (line 2): ~ 40ms
Fill the vector and sort the data directly (line 1+3): ~ 500ms
Fill the vector and sort the data using span (line 2+4): ~750ms

As you can see, just filling the data didn't seem to effect performance, but sorting added about ~250ms on my Haswell machine. Furthermore, adding branch prediction (which is not in the GSL right now, and it should be), didn't seem to help. I suspect, it's because the Expect clauses are actually pretty complicated (they are not simple bounds checks). For example:

constexpr span_iterator& operator+=(difference_type n) noexcept
{
    Expects(span_ && (index_ + n) >= 0 && (index_ + n) <= span_->length());
    index_ += n;
    return *this;
}

As you can see here, in this operation, we are introducing 3 checks, 2 of which introduce addition, and one introduces a function call. Branch prediction helps, but it's only optimizing the pipeline for the overall "if", and doesn't help with the remaining addition / function calls.

I still think that it's worth it, but I would agree that making the claim that there is no performance hit is misleading at best. The better question is... can we optimize span better so that it provides the same checks, but doesn't have the same performance hit?

galik commented 7 years ago

When I did my tests I compiled with -D GSL_UNENFORCED_ON_CONTRACT_VIOLATION which turns bounds checking off altogether. I only build with bounds checking in non-release builds. If the application misbehaves it should be caught during testing.

rianquinn commented 7 years ago

@galik Actually, I would agree with that and my tests showed that as well, there was no performance hit with that enabled. Since Ensure / Expects defaults to std::terminate, there really isn't a huge benefit to having the checks enabled on release builds. If you are using the "throw" version, it might be a little different. In other words, I don't think that span should be considered as a substitute for unit testing. I just wish that C++ had first class support for mocking so that unit testing wasn't such a PITA.

galik commented 7 years ago

I could probably handle runtime checks during span creation or when assigning a new range, but checking every access seems a bit over the top. Once you check the initial bounds there is no point checking every single memory location between them.

Mind you I had been thinking about a discontiguous_span that would present a contiguous view onto separate regions from one or more buffers. That would be ideal for non-destructive editing. Or reading stream data into various different fields in a structure.

neilmacintosh commented 7 years ago

@rianquinn @galik you are both correct:

Yes, there is no need for the compiler to emit multiple redundant range checks. However, span cannot know in advance which ones are redundant, so that is best left to the optimizer, rather than using configuration macros to turn checks at various points within span on/off. When I get a chance, I'll upload the MSVC implementation changes that demonstrate some of the work we've done there.
Yes, the Expects and Ensures clauses can be complex sometimes, and yes, branch prediction can help (thanks for the PR, I'll take a look soon). However, the approach we have looked at so far is having the optimizer remove redundant checks wherever possible to be the minimum required to establish and maintain the required contracts.

My aim for span (at least with MSVC, where I can more easily have direct involvement with the optimizer) is that people should be able to run without turning runtime checks off, unless they have profiled and found a specific hot-path that they cannot tolerate and where the checks are costing them.

Put more concisely (to quote my own presentation): our aim is to be no slower than hand-written code with equivalent safety-checks.

neilmacintosh commented 7 years ago

@maikel the multi_span::iterator has not had much optimization attention given to it. We did some basic improvements a long time ago, but there is probably more that could be done.

neilmacintosh commented 7 years ago

@MikeGitb to answer the original question: the results we have from internal usage of span is that in real-world applications, no one has yet seen it have a significant impact on performance. In all our internal usage (Office, Edge, CppCoreCheck...) there are things going on that are just much more expensive and even if we improved span performance, it would barely show up in the overall application context. So I appreciate the question about real usage contexts, as micro-benchmarks can be a little tricky like that.

Of course, we want span (and all the GSL) to be as efficient as possible, and I'll be committing changes soon that let span better leverage the MSVC optimizer in common situations. However, in the meantime, I would suggest measuring in real-world usage before coming to any severe conclusions. :-)

prog-fh commented 7 years ago

Tried out with my own version of a span-like type and realised that the main noticeable difference that could explain the runtime penalty on generate+sort probably stands in the != iterator operation that tests lhs.span_ == rhs.span_ && lhs.index_ == rhs.index_ where lhs.index_ == rhs.index_ would be sufficient. On straightforward SAXPY-like algorithms the optimizer will guess that lhs.span_ == rhs.span_ can be hoisted but on more complex algorithms (loss of visibility) it won't. Why not use Expects() for this consistency purpose? By the way, shouldn't GSL_UNENFORCED_ON_CONTRACT_VIOLATION be automatically defined when NDEBUG is? (My two cents) Thanks for your work.

rianquinn commented 7 years ago

I noticed the same thing, != was getting the largest penalty for std::sort.

neilmacintosh commented 7 years ago

Thanks @prog-fh!

We encourage people to keep enforcement of contracts turned on if at all possible. So I wouldn't want to disable-by-default, just because NDEBUG happened to be defined. We want it to be an explicit decision for the user.

I'll have a think about moving the "matching span" condition inside an Expects(). However, this sort of optimization presupposes people turning the contract-checking off...which again is something I want to avoid.

Wherever possible, I prefer to leave span as-is, so it acts as a case to put pressure on optimization to do better ;-) Obviously there are some cases where it's just not possible without some source changes, but they are often fewer than people first think. I'm not 100% sure where this one lies, but I'll do some investigation....

MikeGitb commented 7 years ago

Completely disabling contract checking just because a specific part of the program is slowed down is - I think - the wrong approach. I'd much rather provide a couple of wrapper functions for the standard algorithms that

take one or two spans as arguments
ckeck the sizes once at the beginning
and then "unpack" the pointer + length and forward them as a pair of iterators to the standard library implementation.

neilmacintosh commented 7 years ago

@MikeGitb I think that's an excellent suggestion...and in line with the approach we're already taking with copy where we provide a gsl::copy that looks very similar to std::copy but takes span objects directly.

MikeGitb commented 7 years ago

@neilmacintosh: I ran a few more microbenchmarks (with VS2017RC) with std::copy and other stl algorithms and when working on numbers, the performance difference was sometimes not just a few % but up to a full order of magnitude. My guess (I didn't actually dig through the assembly code) is that using gsl::span iterators essentially prevent the generation of vectorized code. Either because certain library optimizations don't fire (like turning a copy loop into memmove) and/or the checks sufficiently confuse the auto-vectorizer.

Aside from the question if there should be more wrapper functions for the STL-algorithms in the gsl, maybe this is something the stl impelmenters could have a look at (especially when implementing the ranges TS).

microsoft / GSL

Performance of algorithms over a gsl::span #376