Open jepers opened 7 years ago
@swift-ci create
First off, some comments on writing performance critical Swift code.
As a developer trying to understand the performance of your low-level
code, I suggest first using Instruments or your favorite Linux
profiler. Build with '-g' if the correlation between disassembly and
source isn't obvious. If you find yourself trying to understand the
disassembly in detail, a fancier disassembler like Hopper can help.
At the level of optimization that you want, a certain amount of
hand-optimization is to be expected. Most of the problems that you're
running into are the same problems that you would face with an
equivalent amount of abstraction in C++. The difference is that it's
still much harder to control the abstraction in Swift and give hints
to the optimizer. The good news is that it's not a fundamental
limitation of the Swift language and major changes are underway to
make high performance programming viable. The bad news is that it will
take a long time before enough support is in place to make Swift a fun
performance language--it's not just a matter of fixing a few optimizer
bugs.
In this case, manually constant propagating the image bounds into the
critical loop is hugely beneficial. Normally, the division inside
"unsafeRemainder" generates a regular integer divide:
idivq
When the divisor is constand the LLVM code generator expands the
integer division into an integer multiply (imulq) and a couple of right
shifts. This is close to 10x faster!
Now, regarding how the Swift optimizer handles this benchmark... I
debugged it and fixed what I consider to be the first-order problem:
certain cases of closures inside generic protocol methods were not
being inlined: PR 15027 "closure inlining after witness
devirtualization".
To actually force full inlining, I also had to annotate a method in VectorMath.swift:
@inline(__always)
func unsafeRemainder(dividingBy other: Self) -> Self
Sadly, by itself this does not solve the bottleneck that you're seeing.
For reference, here's the LLVM code for the critical loop:
; <label>:386: ; preds = %406, %380
%387 = phi i64 [ %384, %380 ], [ -1, %406 ]
%388 = phi i64 [ %383, %380 ], [ %407, %406 ]
%389 = add i64 %387, %378
%390 = add i64 %388, %379
%391 = srem i64 %389, 123
%392 = srem i64 %390, 234
%393 = add nsw i64 %391, 123
%394 = add nsw i64 %392, 234
%395 = srem i64 %393, 123
%396 = srem i64 %394, 234
%397 = load i64, i64* %.e0._value, align 8
%398 = load i64, i64* %.e1, align 8
%399 = mul i64 %395, %397
%400 = mul i64 %396, %398
%401 = add i64 %400, %399
%402 = load i8*, i8** %._rawValue6, align 8
%403 = getelementptr inbounds i8, i8* %402, i64 %401
%.e065._value = bitcast i8* %403 to float*
%404 = load float, float* %.e065._value, align 4
%405 = fadd float %381, %404
br label %380
; <label>:406: ; preds = %380
%407 = add i64 %383, 1
%408 = icmp slt i64 %407, 2
br i1 %408, label %386, label %409
The only difference in the code with -D MAGIC is the constant operands (123, 234) to the 'srem' instructions:
%395 = srem i64 %393, 123
%396 = srem i64 %394, 234
The relevant SIL instructions are:
%literal = integer_literal $Builtin.Int64, 123, loc "./main.swift":50:26, scope 4
%bounds = struct $Int (%literal : $Builtin.Int64), loc "./main.swift":56:34, scope 4
%e0_adr = struct_element_addr %114 : $*Vector2<Int>, #Vector2.e0, loc "./Table/Table.swift":26:66, scope 268
%e0_val_adr = struct_element_addr %179 : $*Int, #Int._value, loc "./Table/Table.swift":26:66, scope 268
%e0_val = load %e0_val_adr : $*Builtin.Int64, loc "./Table/Table.swift":26:66, scope 355
%e0 = struct $Int (%e0_val : $Builtin.Int64), loc "./Table/Table.swift":26:61, scope 355
...
store %e0 to %division_arg : $*Int
With -D MAGIC, the the redundant load at {{%e0_val}} is manually bypassed at the source level so that we now directly store the literal in {{%bounds}} before calling the division builtin.
This is the only significant difference in the optimized SIL:
store %bounds to %division_arg : $*Int
Although it is very difficult for the optimizer to remove loads with
arbitrary calls and writes in between, I think it's worth leaving this
bug open as an opportunity for better alias analysis and redundant
load elimination.
Issue #1: Redundant load elimination during {{Table}} initialization.
The "-D KNOWABLE_SIZE -D MAGIC -D DISPEL_MAGIC" case is the first to fix. It involves a short code sequence:
let img = Table<FloatLinear2D>(size: sz) // Linear gamma grayscale raster image.
#if DISPEL_MAGIC
img[.zero].e0 = 0 // This magically dispels the magic of magicImgBounds.
#endif
#if MAGIC
let magicImgBounds = img.bounds // See how and why this is used (at only one place) in the timed code below.
#endif
I think that stricter alias analysis will be instrumental, and we do have a plan for strengthening alias analysis. Part of the problem here is likely that `img` is a class type so we're loading and storing different elements in the heap.
Issue #2: Redundant load elimination in the critical loop.
With "-D KNOWABLE_SIZE" (no MAGIC), the optimizer fails to remove
a redundant loads in a tight loop that does nothing but call Builtin
arithmetic operations.
for nc in neighborhood.points {
But there is a lot of code between Table initialization and the critical loop. The optimizer needs to be able to evaluate side effects and aliasing.
I think this will minimally involve:
(a) knowing that img
cannot escape (which isn't realistic in real code)
(b) analyzing all access and calls that involve {{img} to prove that size
cannot be modified.
Note that, no matter how well the optimizer does on this particular benchmark, the performance of this code will always be very fragile unless the image bounds is actually an internal global constant that is never modified!
Thank you for taking the time to write such an informative answer, it is much appreciated!
If you want to guarantee that your critical loop doesn't perform integer division I can think of a couple ways to do that without relying on the optimizer. Not sure if either is practical:
Process an image in fixed size blocks, or fixed size ranges and use a global let
to hold that size.
Process ranges in power-of-two sizes and manually shift right rather than calling the unsafeRemainder
api.
Yes, I am aware of options like that. In the original project, the Vector and Table types are part of a little (internal) library for general data processing. The goal is to make it as fast as possible while at the same time being as general / convenient to use as possible.
The pupose of the little library is to make it quick and easy to develop algorithms in an experimental way, testing out different number of dimensions without having to rewrite a lot of boilerplate etc. If an algorithm needs to be faster than what is possible with this little library, we'll write a hard coded version of it, perhaps using Metal, Accelerate or simd, etc.
But the library needs to be optimized anyway, because I've concluded that if I'm not actively working to get every part of it as fast as possible (while still being sufficiently general/convenient to use), it quickly falls apart efficiency-wise and becomes so slow that it's useless even for just rapid prototyping and experimentation, ie I don't want to wait an hour for a result that could take a second if I had cared about optimization.
Note that I'm actually - sometimes - very impressed/satisfied with the optimizer. For example: The Table type is quite a high level abstraction. It has N-dimansional Vector indices, the Vector types being relatively high level abstractions themselves with generic Element and type level Count etc. And still - at least in some contexts - quite complex and generic code that builds upon it will be optimized into code that is as fast as a corresponding hard coded and optimized C version.
Back to the particular issues in this demonstration program:
I assumed that if I created a Table instance with a statically knowable size, the optimizer would (inline small methods according to some heuristic and) make use of the fact that bounds is a constant for this instance.
That is, looking at the code below, I expected the optimizer to see that bounds is a "constant" computed property since it only depends on constants, namely .zero and size, and .zero is always a statically knowable constant value, and for a Table instance that has been initialized with a size given by integer literals, size is also statically knowable:
final class Table<Data: TableData> {
// ...
/// The N-dimensional size (number of elements in each direction).
let size: Data.Coordinate
/// The N-dimensional byte-stride (number of bytes between
/// elements in a row, rows in a plane, planes in a …)
let stride: Data.Coordinate
var bounds: VectorRange<Data.Coordinate> { return .zero ..<& size }
var coordinates: VectorRangePoints<Data.Coordinate> {
return bounds.unsafePoints
}
// ...
}
It turned out that my expectations was partly met, a static size did indeed result in faster execution times, but (somewhat surprisingly) only if I manually propagated the static bounds constant, and (even more surprisingly) it also turned out to be very easy to accidently ruin even my manual propagation by something seemingly unrelated like writing to the data of the table, something which I can't see how it would change the facts that magicBounds is a let constant - and for that matter - the fact that Table's size is a statically knowable let constant and bounds is a "constant computed property", ie computed only from constants.
As I understood your reply, expectations like the ones I just described might possibly be fully met by a future version of the optimizer, but for now, I need to hold its hand quite a bit in some situations, and try to learn and keep up with its sometimes strange and changing ways.
I still don't quite understand how writing to the Table instance's (img's pixel) data can make the optimizer forget that size and bounds are constant. Will read your explanation again.
Regarding the optimizer's failure to recognize size
as a constant... I mentioned redundant load elimination and alias analysis. Those are aggressive attempts to prove that a value doesn't change within a region of code, and fairly fragile optimizations. I think it's possible in your case, but the fact that size is stored in a class makes it a little more difficult. Class storage is inherently mutable, so the analysis needs to prove that the class reference hasn't escaped. Now, in your case, knowledge that the property is a `let` means that a lot of this analysis could be skipped. It seems obvious on the surface, but making all the optimizer pieces do the right thing for `lets` is nontrivial because none of those pieces are aware of when `let` initialization is taking place.
There's a more direct, less fragile approach to solving this problem called LetPropertiesOpt. Currently, that only works for class properties that are statically initialized to a constant (they may as well be static properties). All occurrences are replaced by that constant. I think it would be reasonable to extend that optimization with a flow-sensitive analysis that can handle objects that are initialized with a constant and accessed locally. That's also somewhat fragile because object creation needs to be visible from all of its accesses, but that seems to be one of the main things that you want.
Thanks again for taking the time to explain these things!
As a general workaround (until the optimizer gets better), do you think it would be better to write a type like Table as a struct (instead of a final class), but still with reference semantics? That is:
// Storage would still be a class:
final class TableStorage {
let data: UnsafeMutableRawPointer
// ...
}
// But Table would be a struct with ref semantics:
struct Table<Data: TableData> {
let storage: TableStorage // This is still a possibly shared storage.
/// A pointer to the first byte of the first element.
let baseAddress: UnsafeMutableRawPointer
/// The N-dimensional size (number of elements in each direction).
let size: Data.Coordinate
/// The N-dimensional byte-stride (number of bytes between
/// elements in a row, rows in a plane, planes in a …)
let stride: Data.Coordinate
var bounds: VectorRange<Data.Coordinate> { return .zero ..<& size }
var coordinates: VectorRangePoints<Data.Coordinate> {
return bounds.unsafePoints
}
// ...
}
A Table is (for our purposes) a resource with identity, and it wouldn't make sense to have it be a value type with copy on write, at least not in most cases, so it felt like a natural fit for a class type, and writing it as a struct with reference semantics is kind of unorthodox I guess, but I'm not sure I can see any real down-sides if it would mean that the kind of optimizations we're talking about here would be more likely / predictable?
I would say yes, it's better to use a struct, but if you ever pass that Table\<Data> around, or store it, as an existential (protocol type), then it might result in extra heap allocation. The existential buffer is just three words, and you have at least four words or much more depending on the number of dimensions.
So, I would say it's worth a try but ultimately depends on how the Table is being used.
By the way, we should eventually expose an API for tail-allocated class storage so you don't need the double indirection to get at the table data.
Great, thanks!
To be clear, your original intuition is generally correct. Make it a class if it has class-like semantics. I was only speaking to the possibility of optimizing away the `size` property as a compile-time constant. In certain circumstances, the optimizer may have an easier time "seeing through" the struct initialization.
Yes, understood.
(btw, just in case you'd be interested in having a look at my latest stumbling block: https://bugs.swift.org/browse/SR-7150 )
Attachment: Download
Environment
Recent development snapshots (at least 2017-10-26 … 29), macOS 10.13.Additional Detail from JIRA
| | | |------------------|-----------------| |Votes | 0 | |Component/s | Compiler | |Labels | Bug, Performance | |Assignee | None | |Priority | Medium | md5: c127c1be23378239d82633a19201e410relates to:
Issue Description:
EDIT 1: I have attached an updated version of the code so that it will compile without warnings with snapshot 2017-12-03 (and also added a // NOTE about the need to use @inline(__always) in Table.swift).
EDIT 2: Attached another version of the demonstration program updated to compile with dev snapshot 2018-03-03.
The file below is part of the program in the attachment (the updated one) and explains the issue:
main.swift
So the demonstration is meant to compile without errors and produce images like the one contained in the attached zip file. And the expected result (assuming these really are issues and not expected behaviors) is that the compiler should be able to optimize this program (so that time is 2 rather than 8) with -D KNOWABLE_SIZE but without needing -D MAGIC.
(I first encountered these issues within Xcode and I could've attached an Xcode project or Swift package but I think the risk of accidentally introducing unrelated issues might be reduced by only attaching the source files and a simple build.sh to compile them.)