Open mateusz834 opened 2 years ago
CC @golang/security
This seems like a good use case for a sync.Pool
, to solve the performance issue without duplicating APIs. Could you try benchmarking that?
I am benchmarking now more AddUintLengthPrefixeds in single iteration.
a := NewBuilder(buf)
a.AddUint8LengthPrefixed(func(b *Builder) {
b.AddBytes([]byte("123456"))
b.AddUint8LengthPrefixed(func(b *Builder) {
b.AddBytes([]byte("123456"))
})
b.AddUint8LengthPrefixed(func(b *Builder) {
b.AddBytes([]byte("123456"))
})
b.AddUint8LengthPrefixed(func(b *Builder) {
b.AddBytes([]byte("123456"))
})
})
a.AddUint8LengthPrefixed(func(b *Builder) {
b.AddBytes([]byte("123456"))
b.AddUint8LengthPrefixed(func(b *Builder) {
b.AddBytes([]byte("123456"))
})
})
a.AddUint8LengthPrefixed(func(b *Builder) {
b.AddBytes([]byte("123456"))
b.AddUint8LengthPrefixed(func(b *Builder) {
b.AddBytes([]byte("123456"))
})
})
BenchmarkAddLengthPrefixedOld (now uses sync.Pool)
BenchmarkLengthPrefixed-4 7148973 172.2 ns/op 0 B/op 0 allocs/op
BenchmarkAddLengthPrefixedOld-4 643360 2352 ns/op 958 B/op 6 allocs/op
(in this BenchmarkAddLengthPrefixedOld I also removed 2 additional allocations per AddUintLengthPrefixed) The performance is even worse now.
But we can add a new method called (maybe?) NoParentUsageCheck that will cause the AddUintLengthPrefixed methods to return the original Builder, but then we would not get the parent builder usage panic (because we are reusing the original Builder).
BenchmarkLengthPrefixed-4 6922315 169.9 ns/op 0 B/op 0 allocs/op
BenchmarkAddLengthPrefixedOld-4 4758037 313.0 ns/op 80 B/op 1 allocs/op
Edit: we might also just past nil as the *Builder in LengthPrefixed arg (after calling EmptyBuilder), this way we can remove this last 80B allocation. It looks like the heap analysis is smart, and it might remove the alloc.
BenchmarkLengthPrefixed-4 69277897 174.8 ns/op 0 B/op 0 allocs/op
BenchmarkAddLengthPrefixedOld-4 58995422 192.4 ns/op 0 B/op 0 allocs/op
This is how it could be used then:
b := NewBuilder(buf)
b.EmptyBuilder()
b.AddUint8LengthPrefixed(func(_ *Builder) {
b.AddUint8LengthPrefixed(func(_ *Builder) {
b.AddBytes([]byte("123456"))
})
})
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
It would be good to find some way to do this optimization without doubling the API here.
@rsc https://github.com/golang/go/issues/54059#issuecomment-1196342434 it will only introduce new method NoParentUsageCheck or EmptyBuilder.
So to summary that, we have 2 possibilities: 1) NoParentUsageCheck - child is equal to b, the Builder inside AddUint\LengthPrefixed is just the original Builder.
b := NewBuilder(buf)
b.NoParentUsageCheck()
b.AddUint8LengthPrefixed(func(child *Builder) { {
child.AddBytes([]byte("123456"))
})
2) EmptyBuilder - Builder inside AddUint\LengthPrefixed is nil.
b := NewBuilder(buf)
b.EmptyBuilder()
b.AddUint8LengthPrefixed(func(_ *Builder) { {
b.AddBytes([]byte("123456"))
})
NoParentUsageCheck 313.0 ns/op 80 B/op 1 allocs/op
EmptyBuilder 192.4 ns/op 0 B/op 0 allocs/op
(it benchmarks code as in: https://github.com/golang/go/issues/54059#issuecomment-1196334746)
I feel like the EmptyBuilder way is more explicit what we are doing. NoParentUsageCheck might cause mistakes, which would be really confusing. (it will work, we cannot panic on parent usage, we are reusing *Builder):
b := NewBuilder(buf)
b.NoParentUsageCheck()
b.AddUint8LengthPrefixed(func(c *Builder) { {
c.AddBytes([]byte("123456"))
b.AddBytes([]byte("123456"))
c.AddUint8LengthPrefixed(func(cc *Builder) { {
cc.AddBytes([]byte("123456"))
c.AddBytes([]byte("123456"))
})
})
It will still work as expected, AddBytes() will be written accordingly to the LengthPrefixed part in which in was executed.
/cc @golang/security
The b.NoParentUsageCheck() seems nice, but at that point why not make that the default, so that existing code all runs faster? I can't think of any case where it's valid to use c after the callback returns, and we know the callback never uses b (or cryptobyte would panic). So code should just run faster, with no API additions at all. Thoughts?
I thought that we shouldn't get rid of this panic, that was indeed the reason why i proposed the new api. but as I think about it again it should be fine to do so. Overall I am fine with your idea, it will also simplify the implementation.
@golang/security any thoughts about https://github.com/golang/go/issues/54059#issuecomment-1218310609? That is, just start passing the parent Builder as the child Builder and let all code get faster.
I am trying to figure out a Chesterton's Fence explanation for why the callbacks take a Builder in the first place, and why the panic is implemented and tested. If it's just because it's more explicit than requiring closures and to allow passing a func(*Builder)
, then it feels reasonable to sacrifice the panic to reduce allocations.
Trying to use the parent Builder in the callback is not something developers might do with an expectation that it would do something else than write to the length-prefixed span, and Builders are not to be used concurrently.
👍 from me.
Seems reasonable. This will require a not-insignificant change to the length prefixed logic to use a single builder rather than nested builders, but I think it makes for a slightly less confusing (and faster) API overall.
@rsc does it have to be a proposal then? It will not introduce any new API.
Change https://go.dev/cl/428475 mentions this issue: cryptobyte: AddUint*LengthPrefixed API perfomance optimization
I prototyped an optimization in the CL (above).
Re Chesterton, I think once you have child Builders vs parent Builders you have to have the panic to catch misuse due to confusion. But now what was formerly confused misuse would become correct code.
Even though there are no new API functions, it's a significant enough semantic change to be worth continuing the proposal process to a resolution.
Also I think that we should also add some strict rules to the docs of BuilderContinuation. Like: "inside func(child *Builder)
you can only use the Builder supplied from the argument, there are no guarantees what would happen while using the parent builders".
Also it is worth noting that this will also affect AddASN1 method (not only the LengthPrefixed methods)
Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group
No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group
I may not have followed the discussion closely enough, but why is the sync.Pool not an option (as log does)?
@andig using sync.Pool the performance was even worse than the original implementation.
@mateusz834 please forgive me, responding as a learning exercise for me. Please don't consider me rude. I gave it a quick try as I happened to look into log's usage of sync.Pool time ago. Here's what I did:
var pool = sync.Pool{
New: func() interface{} {
return new(Builder)
},
}
In addLengthPrefixed
:
// b.child = &Builder{
// result: b.result,
// fixedSize: b.fixedSize,
// offset: offset,
// pendingLenLen: lenLen,
// pendingIsASN1: isASN1,
// inContinuation: b.inContinuation,
// }
bb := pool.Get().(*Builder)
// don't need to init child; it must be nil to be put into the pool
bb.err = nil
bb.result = b.result
bb.fixedSize = b.fixedSize
bb.offset = offset
bb.pendingLenLen = lenLen
bb.pendingIsASN1 = isASN1
bb.inContinuation = b.inContinuation
b.child = bb
plus
pool.Put(bb)
in the end. This is the benchmark difference:
func BenchmarkLengthPrefixed(tb *testing.B) {
b := NewBuilder(make([]byte, 0, 512))
for i := 0; i < tb.N; i++ {
b.AddUint8LengthPrefixed(func(b *Builder) {
b.AddBytes([]byte("123456"))
})
b.AddUint8LengthPrefixed(func(b *Builder) {
b.AddBytes([]byte("abcdef"))
})
b.AddUint8LengthPrefixed(func(b *Builder) {
b.AddBytes([]byte("qwerty"))
})
}
}
Goes from
BenchmarkLengthPrefixed-8 4758802 224.2 ns/op 420 B/op 6 allocs/op
to
BenchmarkLengthPrefixed-8 11953297 94.57 ns/op 128 B/op 3 allocs/op
All tests pass except TestASN1(U)Int64
which I couldn't get to compile for some weird reason.
Maybe it's of any use...
@andig Oh, it seems that I've made a mistake in benchmark with sync.Pool. I improved your code and I was able to get to 0 allocs/op.
benchmark old ns/op new ns/op delta
BenchmarkLengthPrefixed-4 1660 371 -77.63%
benchmark old allocs new allocs delta
BenchmarkLengthPrefixed-4 17 0 -100.00%
benchmark old bytes new bytes delta
BenchmarkLengthPrefixed-4 777 0 -100.00%
@andig Thanks for pointing this out.
Not sure which method we should choose now. The proposed approach (reusing *Builder) is a bit faster, but not significantly. For BenchmarkLengthPrefixed it is about ~100ns/op.
Apologies for the mistake.
Change https://go.dev/cl/433503 mentions this issue: cryptobyte: AddUint*LengthPrefixed API perfomance optimization with sync.Pool
Not sure which method we should choose now.
Imho one potential downside of the pool approach is that the Builder.result
will remain in the pool until the Builder
is retrieved. Not sure if that could be a potential security issue?
The (Builder).AddUint*LengthPrefixed allocates on every call a new Builder. This allocation cause ~17% of all heap allocations of a crypto/tls tls handshake.
Proposal:
Add new non allocating API Uint*LengthPrefixed:
The new methods will not return a Builder in the func(), as the original API does. Instedad it will require the usage of the original Builder. All writes to the original Builder, inside Uint\LengthPrexied methods will write them accordingly to the LengthPrefixed part in which they were executed.
Implementation example: commit
Benchmarks: