Open dmcyk opened 7 years ago
@bob-wilson, @ematejska, who do you think should look at this?
aschwaighofer@apple.com (JIRA User) could you look into this?
Today this performance is expected. The benchmark calls the generic/non-generic code across a framework boundary. We don't get the benefit of full specialization. Then the following happens:
IntBitStore.subscript.setter the non generic implementation is fast. Because we know that we are dealing with ints the code boils down to mutating an Array\<Int>. The code path is fully inlined (we know that we are not dealing with an NSArray backing so that code can be statically compiled out). There is not retain/release traffic and only a call to isUniquelyReference runtime function.
BitStore.subscript.setter the generic implementation
1 issue: User error, the setter function was not annotated so we did not call a specialized entry point (not that this would have helped much see other points)
public subscript(index: Int) -> Bool {
// Was missing
==> @_specialize(exported: true, where StoreType: _Trivial(32))
==> @_specialize(exported: true, where StoreType: _Trivial(64))
set {
precondition(index >= 0)
precondition(index < capacity)
let rawIndex = index / _intBitCapacity
let current = index - (rawIndex * _intBitCapacity)
let _buff = rawBuff[rawIndex]
if newValue {
rawBuff[rawIndex] = _buff | (1 << current)
} else {
rawBuff[rawIndex] = _buff & ~(1 << current)
}
}
2. After fixing this performance does not improve even though we call the specialize version to T where T is _Trivial(64)
We still have our indirect representation of generic types
This means that we have witness method calls that the optimizer can't reason about. Therefore there are retain/release on array\<T>
We call unspecialized array\<T> entry points (more retain/release traffic)
The optimizer has not been taught that _swift_isClassOrObjCExistentialType\<T> where T is _Trivial as specified in this context is known to be false.
I believe after moving to a opaque value ssa representation, specializing Array\<T> calls in the implementation, inlining these specialized calls we should get performance that is a lot better.
But there are limits:
Getting the bitwidth will stay a witness method call.
sil @_T08BitStoreAAV9subscriptSbSicfsSbSiAByxGRlze63_s17FixedWidthIntegerRzlItMyyl_Tp5 : $@convention(method) <τ_0_0 where τ_0_0 : _Trivial(64), τ_0_0 : FixedWidthInteger> (Bool, Int, @inout BitStore<τ_0_0>) -> () {
%127 = witness_method $τ_0_0, #FixedWidthInteger.bitWidth!getter.1 : <Self where Self : FixedWidthInteger> (Self.Type) -> () -> Int : $@convention(witness_method) <τ_0_0 where τ_0_0 : FixedWidthInteger> (@thick τ_0_0.Type) -> Int // users: %281, %266, %242, %197, %168, %148, %128
%128 = apply %127<τ_0_0>(%48) : $@convention(witness_method) <τ_0_0 where τ_0_0 : FixedWidthInteger> (@thick τ_0_0.Type) -> Int // user: %129
}
_specialize
isn't something available to users at this point.
Thanks aschwaighofer@apple.com (JIRA User), I wasn't actually aware that subscript could be attributed so.
Using BitStore from within same module does indeed have similar performance to IntBitStore when build for release.
Having checked that though, I noticed some odd behaviour. Using IntBitStore (non-generic) from within one binary is on average \~50% faster compared to using it from a separate framework.
var intBinbuff = IntBitStore(capacity: size)
var results: [Double] = []
for _ in 0 ..< 10 {
let elapsed = Utilities.measureTime {
for i in 0 ..< size {
intBinbuff[i] = true
}
}
results.append(elapsed)
}
print(results.reduce(0, +) / 10)
I increased the size to 50000000 and for 10 repeats for instance in case of one binary it took on average 0.6997 sec per run, while using it from separate framework took on average 1.093 sec.
that's quite a significant difference isn't it?
Very likely that because we inlined the call to intBinbuff[i] we could hoist the uniqueness check of the array buffer in the implementation of this function out of the inner loop. This would explain 50% performance difference. If you look at the profile you would see a call to isUniquelyReferenced that is not in the inner loop doing the intBinbuff[i] = true assignment.
And as Jordan said: This is a unsupported feature and syntax that might stop working, be removed or cause some Cat to turn into a Dog at anytime.
Ah, makes sense. Kinda tricky for the non-generic IntBitStore though, guess I better use everything from within one module next time I have computations heavy task. Thanks!
Attachment: Download
Environment
macOS Sierra 10.12.5, MacBook Air '13 2014 Swift 4.0 May 17, 2017 toolchainAdditional Detail from JIRA
| | | |------------------|-----------------| |Votes | 0 | |Component/s | Compiler | |Labels | Bug, 4.0Regression, Performance | |Assignee | None | |Priority | Medium | md5: 47607273ceb0ee887d1c89ce8dd971a2Issue Description:
Trying new improved integer protocols in Swift 4 I wanted to change some code to generic format, to my surprise I found out there was huge performance decrease when using generic implementation.
In the attachment there's more or less one-to-one implementation of what I called BitStore - a container for binary values encoded on top of array of integers and their bit values.
In `BitStorePerf` target there's an example showing the performance difference, the generic implementation using Int type compared to just normal Int implementation is up to 3 times slower in debug build, when building for release it can be even up to 20 times slower, even though the methods are marked with `specialize` attribute.
*Not sure if that's really a bug or simply trade off of using generics, but that's quite a difference so I thought maybe I should report it. *