Closed MikeGitb closed 7 years ago
I'm not experiencing anything that dramatic. In my testing there is only a slight overhead to using std::span
.
I did three tests, each with a control. The control times how long it takes to shuffle
a container and to checksum its contents. The actual test does a shuffle
followed by a sort
followed by a checksum. I use the time difference between the control and the test as a value representing just the sort
time.
For 1000
iterations of 100000
integers I got the following results (in seconds):
vector<int>
sort cpy: 7.76034
vector<int>&
sort ref: 7.99321
span<int>
sort cpy: 8.3961
Sorting the vector
through a reference is slightly slower that direct sorting, sorting through a span
is slightly slower still. Not by huge amounts though.
Linux: GCC v6.2.1 -O3 -D GSL_UNENFORCED_ON_CONTRACT_VIOLATION
Edit: Just for fun I added a std::array
to the mix which is predictably slightly faster:
std::array<int, N>
sort cpy: [0]: 7.7427
std::vector<int>
sort cpy: [0]: 7.79957
std::vector<int>&
sort ref: [0]: 7.9695
std::span<int>
sort cpy: [0]: 8.40562
@MikeGitb I was actually going to ask if "-O3" was enabled. I also forgot to ask at CppCon if Expects / Ensures sets branch prediction correctly. Once I get a free chance I wanted to look at that, assuming someone doesn't beat me to it 😄. If everything is optimized correctly, it really should not make that big of a difference.
In contrast to these simple benchmarks, I also wondered how the call to linearize
in each deref of a (multi_)span-iterator will slow down production code. I am about to use an adjusted version of strided spans for scientific computations and I will keep a look at this "issue" and see if we personally need to change the internals of the span iterators in our project scope. But I wont draw any premature conclusions before measuring and comparing it in action.
Looking at the span_iterator
, dereferencing it involves first dereferencing a span*
followed by a subscript operation on the result. It may be faster to store a simple pointer to the target data. However that would prevent the detailed bounds checking. I think you would need to store three pointers for the beginning, current and end positions to keep the same level of bounds checking. I suppose you could conditionally include the bounds checking pointers only in builds that enable bounds checking through the GSL_UNENFORCED_ON_CONTRACT_VIOLATION
macro.
Sorry for not being accurate enough. I do not have a problem with span_iterator
, IMO it is completely fine for one-dimensional spans.
The bounds_iterator
for multi-dimensional spans on the other hand stores a multi-dimensional index which gets linearized using the proper strides on every deref, which leads to some computation each time. On the other hand the logic to advance an iterator is simpler this way. Imagine iterating through a multi-dimensional grid, this could still possibly be unnecessary overhead compared to other approaches.
Anyway I just wanted to contribute to the issue, because I think IF there is any fixable performance issue at all it is probably with bounds_iterator
.
@MikeGitb
I was able to verify that your findings are indeed correct. gsl::span
is in fact slower. The numbers that I saw with your example (and -O3
enabled):
As you can see, just filling the data didn't seem to effect performance, but sorting added about ~250ms on my Haswell machine. Furthermore, adding branch prediction (which is not in the GSL right now, and it should be), didn't seem to help. I suspect, it's because the Expect clauses are actually pretty complicated (they are not simple bounds checks). For example:
constexpr span_iterator& operator+=(difference_type n) noexcept
{
Expects(span_ && (index_ + n) >= 0 && (index_ + n) <= span_->length());
index_ += n;
return *this;
}
As you can see here, in this operation, we are introducing 3 checks, 2 of which introduce addition, and one introduces a function call. Branch prediction helps, but it's only optimizing the pipeline for the overall "if", and doesn't help with the remaining addition / function calls.
I still think that it's worth it, but I would agree that making the claim that there is no performance hit is misleading at best. The better question is... can we optimize span better so that it provides the same checks, but doesn't have the same performance hit?
When I did my tests I compiled with -D GSL_UNENFORCED_ON_CONTRACT_VIOLATION
which turns bounds checking off altogether. I only build with bounds checking in non-release builds. If the application misbehaves it should be caught during testing.
@galik Actually, I would agree with that and my tests showed that as well, there was no performance hit with that enabled. Since Ensure / Expects defaults to std::terminate
, there really isn't a huge benefit to having the checks enabled on release builds. If you are using the "throw" version, it might be a little different. In other words, I don't think that span
should be considered as a substitute for unit testing. I just wish that C++ had first class support for mocking so that unit testing wasn't such a PITA.
I could probably handle runtime checks during span
creation or when assigning a new range, but checking every access seems a bit over the top. Once you check the initial bounds there is no point checking every single memory location between them.
Mind you I had been thinking about a discontiguous_span
that would present a contiguous view onto separate regions from one or more buffers. That would be ideal for non-destructive editing. Or reading stream data into various different fields in a structure.
@rianquinn @galik you are both correct:
span
cannot know in advance which ones are redundant, so that is best left to the optimizer, rather than using configuration macros to turn checks at various points within span
on/off. When I get a chance, I'll upload the MSVC implementation changes that demonstrate some of the work we've done there.My aim for span
(at least with MSVC, where I can more easily have direct involvement with the optimizer) is that people should be able to run without turning runtime checks off, unless they have profiled and found a specific hot-path that they cannot tolerate and where the checks are costing them.
Put more concisely (to quote my own presentation): our aim is to be no slower than hand-written code with equivalent safety-checks.
@maikel the multi_span::iterator
has not had much optimization attention given to it. We did some basic improvements a long time ago, but there is probably more that could be done.
@MikeGitb to answer the original question: the results we have from internal usage of span
is that in real-world applications, no one has yet seen it have a significant impact on performance. In all our internal usage (Office, Edge, CppCoreCheck...) there are things going on that are just much more expensive and even if we improved span
performance, it would barely show up in the overall application context. So I appreciate the question about real usage contexts, as micro-benchmarks can be a little tricky like that.
Of course, we want span
(and all the GSL) to be as efficient as possible, and I'll be committing changes soon that let span
better leverage the MSVC optimizer in common situations. However, in the meantime, I would suggest measuring in real-world usage before coming to any severe conclusions. :-)
Tried out with my own version of a span-like type and realised that the main noticeable difference that could explain the runtime penalty on generate+sort probably stands in the != iterator operation that tests lhs.span_ == rhs.span_ && lhs.index_ == rhs.index_
where lhs.index_ == rhs.index_
would be sufficient.
On straightforward SAXPY-like algorithms the optimizer will guess that lhs.span_ == rhs.span_
can be hoisted but on more complex algorithms (loss of visibility) it won't.
Why not use Expects()
for this consistency purpose?
By the way, shouldn't GSL_UNENFORCED_ON_CONTRACT_VIOLATION be automatically defined when NDEBUG is?
(My two cents)
Thanks for your work.
I noticed the same thing, != was getting the largest penalty for std::sort.
Thanks @prog-fh!
We encourage people to keep enforcement of contracts turned on if at all possible. So I wouldn't want to disable-by-default, just because NDEBUG
happened to be defined. We want it to be an explicit decision for the user.
I'll have a think about moving the "matching span" condition inside an Expects(). However, this sort of optimization presupposes people turning the contract-checking off...which again is something I want to avoid.
Wherever possible, I prefer to leave span
as-is, so it acts as a case to put pressure on optimization to do better ;-) Obviously there are some cases where it's just not possible without some source changes, but they are often fewer than people first think. I'm not 100% sure where this one lies, but I'll do some investigation....
Completely disabling contract checking just because a specific part of the program is slowed down is - I think - the wrong approach. I'd much rather provide a couple of wrapper functions for the standard algorithms that
@MikeGitb I think that's an excellent suggestion...and in line with the approach we're already taking with copy
where we provide a gsl::copy
that looks very similar to std::copy
but takes span
objects directly.
@neilmacintosh: I ran a few more microbenchmarks (with VS2017RC) with std::copy and other stl algorithms and when working on numbers, the performance difference was sometimes not just a few % but up to a full order of magnitude. My guess (I didn't actually dig through the assembly code) is that using gsl::span iterators essentially prevent the generation of vectorized code. Either because certain library optimizations don't fire (like turning a copy loop into memmove) and/or the checks sufficiently confuse the auto-vectorizer.
Aside from the question if there should be more wrapper functions for the STL-algorithms in the gsl, maybe this is something the stl impelmenters could have a look at (especially when implementing the ranges TS).
Does anyone have any real-world data on how using gsl::span affects overall application performance? I wrote a few ver simple micro benchmarks similar to the following:
I compiled an ran 4 different Versions with VS2015 update 3 in release mode on a Win10 IvyBridge machine and got the following results:
Assuming there is no systematic error in my benchmark (I don't want to rule that out at all), using
gsl::span
does incur a significant performance penalty when used with standard algorithms. I can't tell if this is because the compiler is unable to eliminate the range checks or because a span iterator is a bigger object.I guess my question is, if that overhead has been a problem for anyone in real applications and if/how this should be addressed or if someone might even already have a solution for this. Results from other platforms are also welcomed.
A partial solution that comes to mind would e.g. be to provide wrappers for the standard algorithms which take
gsl::span
s as parameters but internally pass raw pointers to the stl-algorithms. E.g. forsort
:Of course with all the appropriate template logic.