Open weissi opened 5 years ago
cc @rjmccall, @eeckstein
Apple Swift version 5.0-dev (LLVM 082dec2e22, Swift 4e86c8ff52)
Target: x86_64-apple-darwin18.5.0
(same for Swift 5.0.1) ran
jw-swift-latest swift -version && jw-swift-latest swiftc -O test.swift && sudo ~/devel/swift-nio/dev/malloc-aggregation.d -c ./test
better stack for the allocation:
(lldb) bt
* thread #​1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
* frame #​0: 0x00007fff6532ab69 libsystem_malloc.dylib`malloc
frame #​1: 0x0000000100000e14 test`specialized _NativeDictionary.subscript.modify at <compiler-generated>:0 [opt]
frame #​2: 0x0000000100000cf9 test`repro() [inlined] generic specialization <Swift.Int, Swift.Int> of Swift.Dictionary._Variant.subscript.modify : (A) -> Swift.Optional<B> at <compiler-generated>:0 [opt]
frame #​3: 0x0000000100000ce0 test`repro() [inlined] generic specialization <Swift.Int, Swift.Int> of Swift.Dictionary.subscript.modify : (A) -> Swift.Optional<B> at <compiler-generated>:0 [opt]
frame #​4: 0x0000000100000ce0 test`repro() at test.swift:12 [opt]
frame #​5: 0x0000000100000c49 test`main at test.swift:17:1 [opt]
frame #​6: 0x00007fff651803d5 libdyld.dylib`start + 1
the assembly confirms that it's allocating unconditionally pretty much before doing anything else
(lldb) frame select 1
test was compiled with optimization - stepping may behave oddly; variables may not be available.
frame #​1: 0x0000000100000e14 test`specialized _NativeDictionary.subscript.modify at <compiler-generated>:0 [opt]
(lldb) dis
test`specialized _NativeDictionary.subscript.modify:
0x100000df0 <+0>: pushq %rbp
0x100000df1 <+1>: movq %rsp, %rbp
0x100000df4 <+4>: pushq %r15
0x100000df6 <+6>: pushq %r14
0x100000df8 <+8>: pushq %r13
0x100000dfa <+10>: pushq %r12
0x100000dfc <+12>: pushq %rbx
0x100000dfd <+13>: subq $0x18, %rsp
0x100000e01 <+17>: movl %edx, -0x3c(%rbp)
0x100000e04 <+20>: movq %rsi, %r15
0x100000e07 <+23>: movq %rdi, %rbx
0x100000e0a <+26>: movl $0x40, %edi
0x100000e0f <+31>: callq 0x100001e82 ; symbol stub for: malloc
-> 0x100000e14 <+36>: movq %rax, %r14
0x100000e17 <+39>: movq %rax, (%rbx)
0x100000e1a <+42>: movq %r13, 0x8(%rax)
0x100000e1e <+46>: movq %r15, (%rax)
0x100000e21 <+49>: movq %r13, -0x38(%rbp)
0x100000e25 <+53>: movq (%r13), %r12
0x100000e29 <+57>: movq 0x28(%r12), %rdi
0x100000e2e <+62>: movq %r15, %rsi
0x100000e31 <+65>: callq 0x100001e70 ; symbol stub for: static Swift.Hasher._hash(seed: Swift.Int, _: Swift.UInt64) -> Swift.Int
[...]
The first BB in the SIL looks like this
// specialized _NativeDictionary.subscript.modify
sil shared [always_inline] @$ss17_NativeDictionaryV_8isUniqueq_Sgx_SbtciMSi_SiTg5 : $@yield_once @convention(method) (Int, Bool, @inout _NativeDictionary<Int, Int>) -> @yields @inout Optional<Int> {
// %0 // users: %173, %183, %3
// %1 // users: %97, %105
// %2 // users: %181, %171, %154, %137, %301, %298, %189, %192, %110, %4
bb0(%0 : $Int, %1 : $Bool, %2 : $*_NativeDictionary<Int, Int>):
%3 = struct_extract %0 : $Int, #Int._value // users: %52, %232, %273, %315, %9
%4 = load %2 : $*_NativeDictionary<Int, Int> // users: %337, %317, %51, %5
%5 = struct_extract %4 : $_NativeDictionary<Int, Int>, #_NativeDictionary._storage // users: %305, %89, %75, %43, %14, %12, %6
%6 = ref_element_addr %5 : $__RawDictionaryStorage, #__RawDictionaryStorage._seed // user: %7
%7 = load %6 : $*Int // user: %11
%8 = metatype $@thin Hasher.Type // users: %196, %11
%9 = struct $UInt64 (%3 : $Builtin.Int64) // users: %196, %11
// function_ref static Hasher._hash(seed:_:)
%10 = function_ref @$ss6HasherV5_hash4seed_S2i_s6UInt64VtFZ : $@convention(method) (Int, UInt64, @thin Hasher.Type) -> Int // users: %196, %11
%11 = apply %10(%7, %9, %8) : $@convention(method) (Int, UInt64, @thin Hasher.Type) -> Int // user: %24
%12 = ref_tail_addr %5 : $__RawDictionaryStorage, $_UnsafeBitset.Word // users: %62, %326, %32
%13 = integer_literal $Builtin.Int64, 1 // users: %78, %68, %55, %248, %235, %332, %319, %290, %277, %218, %204, %203, %38, %22, %20
%14 = ref_element_addr %5 : $__RawDictionaryStorage, #__RawDictionaryStorage._scale // user: %15
%15 = struct_element_addr %14 : $*Int8, #Int8._value // user: %16
%16 = load %15 : $*Builtin.Int8 // user: %17
%17 = builtin "sextOrBitCast_Int8_Int64"(%16 : $Builtin.Int8) : $Builtin.Int64 // user: %19
%18 = integer_literal $Builtin.Int64, 63 // users: %67, %247, %331, %289, %217, %202, %37, %19
%19 = builtin "and_Int64"(%17 : $Builtin.Int64, %18 : $Builtin.Int64) : $Builtin.Int64 // user: %20
%20 = builtin "shl_Int64"(%13 : $Builtin.Int64, %19 : $Builtin.Int64) : $Builtin.Int64 // user: %22
%21 = integer_literal $Builtin.Int1, 0 // users: %78, %55, %235, %319, %277, %204, %95, %22
%22 = builtin "ssub_with_overflow_Int64"(%20 : $Builtin.Int64, %13 : $Builtin.Int64, %21 : $Builtin.Int1) : $(Builtin.Int64, Builtin.Int1) // user: %23
%23 = tuple_extract %22 : $(Builtin.Int64, Builtin.Int1), 0 // users: %57, %321, %25
%24 = struct_extract %11 : $Int, #Int._value // user: %25
%25 = builtin "and_Int64"(%24 : $Builtin.Int64, %23 : $Builtin.Int64) : $Builtin.Int64 // users: %55, %46, %36, %30, %26
%26 = struct $Int (%25 : $Builtin.Int64) // user: %27
%27 = struct $_HashTable.Bucket (%26 : $Int) // users: %54, %42
%28 = integer_literal $Builtin.Int64, 64 // users: %66, %60, %246, %240, %330, %324, %288, %282, %216, %210, %36, %30
%29 = integer_literal $Builtin.Int64, 0 // users: %308, %70, %250, %334, %292, %220, %40
%30 = builtin "udiv_Int64"(%25 : $Builtin.Int64, %28 : $Builtin.Int64) : $Builtin.Int64 // user: %31
%31 = builtin "truncOrBitCast_Int64_Word"(%30 : $Builtin.Int64) : $Builtin.Word // user: %32
%32 = index_addr %12 : $*_UnsafeBitset.Word, %31 : $Builtin.Word // user: %33
%33 = struct_element_addr %32 : $*_UnsafeBitset.Word, #_UnsafeBitset.Word.value // user: %34
%34 = struct_element_addr %33 : $*UInt, #UInt._value // user: %35
%35 = load %34 : $*Builtin.Int64 // user: %39
%36 = builtin "urem_Int64"(%25 : $Builtin.Int64, %28 : $Builtin.Int64) : $Builtin.Int64 // user: %37
%37 = builtin "and_Int64"(%36 : $Builtin.Int64, %18 : $Builtin.Int64) : $Builtin.Int64 // user: %38
%38 = builtin "shl_Int64"(%13 : $Builtin.Int64, %37 : $Builtin.Int64) : $Builtin.Int64 // user: %39
%39 = builtin "and_Int64"(%35 : $Builtin.Int64, %38 : $Builtin.Int64) : $Builtin.Int64 // user: %40
%40 = builtin "cmp_eq_Int64"(%39 : $Builtin.Int64, %29 : $Builtin.Int64) : $Builtin.Int1 // user: %41
cond_br %40, bb1, bb2 // id: %41
We know there's work to do to make the coroutine lowering passes more aggressive about not heap-allocating the stack frame if e.g. there are multiple resume points. This SR will serve as a fine anchor for tracking that work.
@swift-ci create
I submitted a stdlib-side workaround:
https://github.com/apple/swift/pull/26549
This can be reverted when the underlying optimization issue is resolved.
Recent testing with Swift 5.3 seems to show allocations back in the dictionary modify accessor, in the exact same place. @lorentey did we regress this in any way?
Additional Detail from JIRA
| | | |------------------|-----------------| |Votes | 1 | |Component/s | Compiler, Standard Library | |Labels | Bug, Performance | |Assignee | None | |Priority | Medium | md5: 4ba5c6b107c61899a4524c1db392517brelates to:
Issue Description:
@Lukasa just showed me some of his HTTP/2 code that allocates a lot. He saw it allocating in presumably here: https://github.com/apple/swift/blob/577e04d35be3dba4954bd99eac5ef2af264c60b3/stdlib/public/core/DictionaryVariant.swift#L97
But the whole complexity of the http/2 code isn't actually necessary, even the most simple use of the modify accessor triggers the allocations:
ran, produces under dtrace that traces every
malloc
:That means for 10,000 calls to the modify accessor, we allocated 10,000 times :'(.
We can reduce the code further to