[SR-4939] Generics code performance

dmcyk commented 7 years ago


Previous ID	SR-4939
Radar	None
Original Reporter	@dmcyk
Type	Bug

Attachment: Download

Environment

macOS Sierra 10.12.5, MacBook Air '13 2014 Swift 4.0 May 17, 2017 toolchain

Additional Detail from JIRA

| | | |------------------|-----------------| |Votes | 0 | |Component/s | Compiler | |Labels | Bug, 4.0Regression, Performance | |Assignee | None | |Priority | Medium | md5: 47607273ceb0ee887d1c89ce8dd971a2

Issue Description:

Trying new improved integer protocols in Swift 4 I wanted to change some code to generic format, to my surprise I found out there was huge performance decrease when using generic implementation.

In the attachment there's more or less one-to-one implementation of what I called BitStore - a container for binary values encoded on top of array of integers and their bit values.

In `BitStorePerf` target there's an example showing the performance difference, the generic implementation using Int type compared to just normal Int implementation is up to 3 times slower in debug build, when building for release it can be even up to 20 times slower, even though the methods are marked with `specialize` attribute.

*Not sure if that's really a bug or simply trade off of using generics, but that's quite a difference so I thought maybe I should report it. *

belkadan commented 7 years ago

@bob-wilson, @ematejska, who do you think should look at this?

bob-wilson commented 7 years ago

aschwaighofer@apple.com (JIRA User) could you look into this?

aschwaighofer commented 7 years ago

Today this performance is expected. The benchmark calls the generic/non-generic code across a framework boundary. We don't get the benefit of full specialization. Then the following happens:

IntBitStore.subscript.setter the non generic implementation is fast. Because we know that we are dealing with ints the code boils down to mutating an Array\<Int>. The code path is fully inlined (we know that we are not dealing with an NSArray backing so that code can be statically compiled out). There is not retain/release traffic and only a call to isUniquelyReference runtime function.

BitStore.subscript.setter the generic implementation
1 issue: User error, the setter function was not annotated so we did not call a specialized entry point (not that this would have helped much see other points)

   public subscript(index: Int) -> Bool {
// Was missing
==>        @_specialize(exported: true, where StoreType: _Trivial(32))
==>        @_specialize(exported: true, where StoreType: _Trivial(64))
        set {
            precondition(index >= 0)
            precondition(index < capacity)
            let rawIndex = index / _intBitCapacity
            let current = index - (rawIndex * _intBitCapacity)
            let _buff = rawBuff[rawIndex]
            if newValue {
                rawBuff[rawIndex] = _buff | (1 << current)
            } else {
                rawBuff[rawIndex] = _buff & ~(1 << current)
            }
        }

2. After fixing this performance does not improve even though we call the specialize version to T where T is _Trivial(64)

We still have our indirect representation of generic types
This means that we have witness method calls that the optimizer can't reason about. Therefore there are retain/release on array\<T>
We call unspecialized array\<T> entry points (more retain/release traffic)
The optimizer has not been taught that _swift_isClassOrObjCExistentialType\<T> where T is _Trivial as specified in this context is known to be false.

I believe after moving to a opaque value ssa representation, specializing Array\<T> calls in the implementation, inlining these specialized calls we should get performance that is a lot better.

But there are limits:

Getting the bitwidth will stay a witness method call.

sil @_T08BitStoreAAV9subscriptSbSicfsSbSiAByxGRlze63_s17FixedWidthIntegerRzlItMyyl_Tp5 : $@convention(method) <τ_0_0 where τ_0_0 : _Trivial(64), τ_0_0 : FixedWidthInteger> (Bool, Int, @inout BitStore<τ_0_0>) -> () {

 %127 = witness_method $τ_0_0, #FixedWidthInteger.bitWidth!getter.1 : <Self where Self : FixedWidthInteger> (Self.Type) -> () -> Int : $@convention(witness_method) <τ_0_0 where τ_0_0 : FixedWidthInteger> (@thick τ_0_0.Type) -> Int // users: %281, %266, %242, %197, %168, %148, %128
  %128 = apply %127<τ_0_0>(%48) : $@convention(witness_method) <τ_0_0 where τ_0_0 : FixedWidthInteger> (@thick τ_0_0.Type) -> Int // user: %129

}

belkadan commented 7 years ago

_specialize isn't something available to users at this point.

dmcyk commented 7 years ago

Thanks aschwaighofer@apple.com (JIRA User), I wasn't actually aware that subscript could be attributed so.

Using BitStore from within same module does indeed have similar performance to IntBitStore when build for release.
Having checked that though, I noticed some odd behaviour. Using IntBitStore (non-generic) from within one binary is on average \~50% faster compared to using it from a separate framework.

var intBinbuff = IntBitStore(capacity: size)

var results: [Double] = []

for _ in 0 ..< 10 {
    let elapsed = Utilities.measureTime {
        for i in 0 ..< size {
            intBinbuff[i] = true
        }
    }

    results.append(elapsed)
}
print(results.reduce(0, +) / 10)

I increased the size to 50000000 and for 10 repeats for instance in case of one binary it took on average 0.6997 sec per run, while using it from separate framework took on average 1.093 sec.
that's quite a significant difference isn't it?

aschwaighofer commented 7 years ago

Very likely that because we inlined the call to intBinbuff[i] we could hoist the uniqueness check of the array buffer in the implementation of this function out of the inner loop. This would explain 50% performance difference. If you look at the profile you would see a call to isUniquelyReferenced that is not in the inner loop doing the intBinbuff[i] = true assignment.

aschwaighofer commented 7 years ago

And as Jordan said: This is a unsupported feature and syntax that might stop working, be removed or cause some Cat to turn into a Dog at anytime.

dmcyk commented 7 years ago

Ah, makes sense. Kinda tricky for the non-generic IntBitStore though, guess I better use everything from within one module next time I have computations heavy task. Thanks!

swiftlang / swift

[SR-4939] Generics code performance #47516