x/crypto/cryptobyte: non allocating AddUint*LengthPrefixed and AddASN1 API

mateusz834 commented 2 years ago

The (Builder).AddUint*LengthPrefixed allocates on every call a new Builder. This allocation cause ~17% of all heap allocations of a crypto/tls tls handshake.

Proposal:

Add new non allocating API Uint*LengthPrefixed:

p := NewBuilder(make([]byte,0,512))
p.Uint8LengthPrefixed(func () {
    p.Uint8LengthPrefixed(func () {
        p.AddUint8(42)
    })
    p.AddUint8(11)
})

The new methods will not return a Builder in the func(), as the original API does. Instedad it will require the usage of the original Builder. All writes to the original Builder, inside Uint\LengthPrexied methods will write them accordingly to the LengthPrefixed part in which they were executed.

Implementation example: commit

Benchmarks:

BenchmarkLengthPrefixed-4            15930202                70.18 ns/op            0 B/op          0 allocs/op
BenchmarkAddLengthPrefixedOld-4          2160042               762.2 ns/op           292 B/op          7 allocs/op

b := NewBuilder(buf)
b.Uint8LengthPrefixed(func() {
    b.AddBytes([]byte("123456"))
})
b.Uint8LengthPrefixed(func() {
    b.AddBytes([]byte("abcdef"))
})
b.Uint8LengthPrefixed(func() {
        b.AddBytes([]byte("qwerty"))
})

ianlancetaylor commented 2 years ago

CC @golang/security

FiloSottile commented 2 years ago

This seems like a good use case for a sync.Pool, to solve the performance issue without duplicating APIs. Could you try benchmarking that?

mateusz834 commented 2 years ago

I am benchmarking now more AddUintLengthPrefixeds in single iteration.

a := NewBuilder(buf)
a.AddUint8LengthPrefixed(func(b *Builder) {
    b.AddBytes([]byte("123456"))
    b.AddUint8LengthPrefixed(func(b *Builder) {
        b.AddBytes([]byte("123456"))
    })
    b.AddUint8LengthPrefixed(func(b *Builder) {
        b.AddBytes([]byte("123456"))
    })
    b.AddUint8LengthPrefixed(func(b *Builder) {
        b.AddBytes([]byte("123456"))
    })
})
a.AddUint8LengthPrefixed(func(b *Builder) {
    b.AddBytes([]byte("123456"))
    b.AddUint8LengthPrefixed(func(b *Builder) {
        b.AddBytes([]byte("123456"))
    })
})
a.AddUint8LengthPrefixed(func(b *Builder) {
    b.AddBytes([]byte("123456"))
    b.AddUint8LengthPrefixed(func(b *Builder) {
        b.AddBytes([]byte("123456"))
    })
})

BenchmarkAddLengthPrefixedOld (now uses sync.Pool)

BenchmarkLengthPrefixed-4       7148973             172.2 ns/op             0 B/op          0 allocs/op
BenchmarkAddLengthPrefixedOld-4           643360              2352 ns/op             958 B/op          6 allocs/op

(in this BenchmarkAddLengthPrefixedOld I also removed 2 additional allocations per AddUintLengthPrefixed) The performance is even worse now.

mateusz834 commented 2 years ago

But we can add a new method called (maybe?) NoParentUsageCheck that will cause the AddUintLengthPrefixed methods to return the original Builder, but then we would not get the parent builder usage panic (because we are reusing the original Builder).

BenchmarkLengthPrefixed-4             6922315               169.9 ns/op             0 B/op       0 allocs/op
BenchmarkAddLengthPrefixedOld-4          4758037               313.0 ns/op            80 B/op       1 allocs/op

Edit: we might also just past nil as the *Builder in LengthPrefixed arg (after calling EmptyBuilder), this way we can remove this last 80B allocation. It looks like the heap analysis is smart, and it might remove the alloc.

BenchmarkLengthPrefixed-4            69277897               174.8 ns/op             0 B/op            0 allocs/op
BenchmarkAddLengthPrefixedOld-4         58995422               192.4 ns/op             0 B/op            0 allocs/op

This is how it could be used then:

b := NewBuilder(buf)
b.EmptyBuilder()
b.AddUint8LengthPrefixed(func(_ *Builder) {
    b.AddUint8LengthPrefixed(func(_ *Builder) {
        b.AddBytes([]byte("123456"))
    })
})

rsc commented 2 years ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

rsc commented 2 years ago

It would be good to find some way to do this optimization without doubling the API here.

mateusz834 commented 2 years ago

@rsc https://github.com/golang/go/issues/54059#issuecomment-1196342434 it will only introduce new method NoParentUsageCheck or EmptyBuilder.

mateusz834 commented 2 years ago

So to summary that, we have 2 possibilities: 1) NoParentUsageCheck - child is equal to b, the Builder inside AddUint\LengthPrefixed is just the original Builder.

b := NewBuilder(buf)
b.NoParentUsageCheck()
b.AddUint8LengthPrefixed(func(child *Builder) { {
    child.AddBytes([]byte("123456"))
})

2) EmptyBuilder - Builder inside AddUint\LengthPrefixed is nil.

b := NewBuilder(buf)
b.EmptyBuilder()
b.AddUint8LengthPrefixed(func(_ *Builder) { {
    b.AddBytes([]byte("123456"))
})

NoParentUsageCheck 313.0 ns/op 80 B/op 1 allocs/op EmptyBuilder 192.4 ns/op 0 B/op 0 allocs/op (it benchmarks code as in: https://github.com/golang/go/issues/54059#issuecomment-1196334746)

I feel like the EmptyBuilder way is more explicit what we are doing. NoParentUsageCheck might cause mistakes, which would be really confusing. (it will work, we cannot panic on parent usage, we are reusing *Builder):

b := NewBuilder(buf)
b.NoParentUsageCheck()
b.AddUint8LengthPrefixed(func(c *Builder) { {
    c.AddBytes([]byte("123456"))
    b.AddBytes([]byte("123456"))
        c.AddUint8LengthPrefixed(func(cc *Builder) { {
            cc.AddBytes([]byte("123456"))
            c.AddBytes([]byte("123456"))
        })
})

It will still work as expected, AddBytes() will be written accordingly to the LengthPrefixed part in which in was executed.

rsc commented 2 years ago

/cc @golang/security

rsc commented 2 years ago

The b.NoParentUsageCheck() seems nice, but at that point why not make that the default, so that existing code all runs faster? I can't think of any case where it's valid to use c after the callback returns, and we know the callback never uses b (or cryptobyte would panic). So code should just run faster, with no API additions at all. Thoughts?

mateusz834 commented 2 years ago

I thought that we shouldn't get rid of this panic, that was indeed the reason why i proposed the new api. but as I think about it again it should be fine to do so. Overall I am fine with your idea, it will also simplify the implementation.

rsc commented 2 years ago

@golang/security any thoughts about https://github.com/golang/go/issues/54059#issuecomment-1218310609? That is, just start passing the parent Builder as the child Builder and let all code get faster.

FiloSottile commented 2 years ago

I am trying to figure out a Chesterton's Fence explanation for why the callbacks take a Builder in the first place, and why the panic is implemented and tested. If it's just because it's more explicit than requiring closures and to allow passing a func(*Builder), then it feels reasonable to sacrifice the panic to reduce allocations.

Trying to use the parent Builder in the callback is not something developers might do with an expectation that it would do something else than write to the length-prefixed span, and Builders are not to be used concurrently.

👍 from me.

rolandshoemaker commented 2 years ago

Seems reasonable. This will require a not-insignificant change to the length prefixed logic to use a single builder rather than nested builders, but I think it makes for a slightly less confusing (and faster) API overall.

mateusz834 commented 2 years ago

@rsc does it have to be a proposal then? It will not introduce any new API.

gopherbot commented 2 years ago

Change https://go.dev/cl/428475 mentions this issue: cryptobyte: AddUint*LengthPrefixed API perfomance optimization

mateusz834 commented 2 years ago

I prototyped an optimization in the CL (above).

rsc commented 2 years ago

Re Chesterton, I think once you have child Builders vs parent Builders you have to have the panic to catch misuse due to confusion. But now what was formerly confused misuse would become correct code.

Even though there are no new API functions, it's a significant enough semantic change to be worth continuing the proposal process to a resolution.

mateusz834 commented 2 years ago

Also I think that we should also add some strict rules to the docs of BuilderContinuation. Like: "inside func(child *Builder) you can only use the Builder supplied from the argument, there are no guarantees what would happen while using the parent builders".

mateusz834 commented 2 years ago

Also it is worth noting that this will also affect AddASN1 method (not only the LengthPrefixed methods)

rsc commented 1 year ago

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

rsc commented 1 year ago

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

andig commented 1 year ago

I may not have followed the discussion closely enough, but why is the sync.Pool not an option (as log does)?

mateusz834 commented 1 year ago

@andig using sync.Pool the performance was even worse than the original implementation.

andig commented 1 year ago

@mateusz834 please forgive me, responding as a learning exercise for me. Please don't consider me rude. I gave it a quick try as I happened to look into log's usage of sync.Pool time ago. Here's what I did:

var pool = sync.Pool{
    New: func() interface{} {
        return new(Builder)
    },
}

In addLengthPrefixed:

// b.child = &Builder{
//  result:         b.result,
//  fixedSize:      b.fixedSize,
//  offset:         offset,
//  pendingLenLen:  lenLen,
//  pendingIsASN1:  isASN1,
//  inContinuation: b.inContinuation,
// }

bb := pool.Get().(*Builder)

// don't need to init child; it must be nil to be put into the pool
bb.err = nil
bb.result = b.result
bb.fixedSize = b.fixedSize
bb.offset = offset
bb.pendingLenLen = lenLen
bb.pendingIsASN1 = isASN1
bb.inContinuation = b.inContinuation

b.child = bb

plus

pool.Put(bb)

in the end. This is the benchmark difference:

func BenchmarkLengthPrefixed(tb *testing.B) {
    b := NewBuilder(make([]byte, 0, 512))

    for i := 0; i < tb.N; i++ {
        b.AddUint8LengthPrefixed(func(b *Builder) {
            b.AddBytes([]byte("123456"))
        })
        b.AddUint8LengthPrefixed(func(b *Builder) {
            b.AddBytes([]byte("abcdef"))
        })
        b.AddUint8LengthPrefixed(func(b *Builder) {
            b.AddBytes([]byte("qwerty"))
        })
    }
}

Goes from

BenchmarkLengthPrefixed-8        4758802           224.2 ns/op       420 B/op          6 allocs/op

to

BenchmarkLengthPrefixed-8       11953297            94.57 ns/op      128 B/op          3 allocs/op

All tests pass except TestASN1(U)Int64 which I couldn't get to compile for some weird reason.

Maybe it's of any use...

mateusz834 commented 1 year ago

@andig Oh, it seems that I've made a mistake in benchmark with sync.Pool. I improved your code and I was able to get to 0 allocs/op.

benchmark                     old ns/op     new ns/op     delta
BenchmarkLengthPrefixed-4     1660          371           -77.63%

benchmark                     old allocs     new allocs     delta
BenchmarkLengthPrefixed-4     17             0              -100.00%

benchmark                     old bytes     new bytes     delta
BenchmarkLengthPrefixed-4     777           0             -100.00%

@andig Thanks for pointing this out.

Not sure which method we should choose now. The proposed approach (reusing *Builder) is a bit faster, but not significantly. For BenchmarkLengthPrefixed it is about ~100ns/op.

Apologies for the mistake.

gopherbot commented 1 year ago

Change https://go.dev/cl/433503 mentions this issue: cryptobyte: AddUint*LengthPrefixed API perfomance optimization with sync.Pool

andig commented 1 year ago

Not sure which method we should choose now.

Imho one potential downside of the pool approach is that the Builder.result will remain in the pool until the Builder is retrieved. Not sure if that could be a potential security issue?

golang / go

x/crypto/cryptobyte: non allocating AddUint*LengthPrefixed and AddASN1 API #54059