Findings on performance

These are finding based on running a bench mark in go-graphsync. The benchmark simulates a single node getting a 10K file split in approximately 1K chunks from 20 different nodes — since it’s single process the mem profile is both the single node and 20 people responding. The thing I’m mucking around with is the metadata sent in the response — every block covered by selector traversal needs to have an IPLD structure like this sent with it:

{
   Link: ipld.Link,
   BlockPresent: bool
}

And array of these is sent for each response over the wire.

So first, the baseline results (ipld-prime 0.5.1):

BenchmarkRoundtripSuccess/test-20-10000-16        240     4675299 ns/op  3504042 B/op     58405 allocs/op

I’ll share the mem profiles in a sec — but importantly cause go benchmarks run a variable number of times, the allocs/op is probably the most relevant number for “best overall perforamance” — the mem profile itself actually provides the why rather than the what. Ok so now here is master using cbor-gen:

BenchmarkRoundtripSuccess/test-20-10000-16               262       5102731 ns/op     2979568 B/op      50628 allocs/op

As you can see we’re losing about 8000 allocs/op

Now, first of all, ipld-prime is an an immediate disadvantage here, cause currently the code works directly with a go struct, and then when we encode we copy it to a new IPLD data structure, and then encode it. So that’s not fair at all as a comparison, cause cbor-gen just encodes the go data structure directly. What if we change the code to work directly with the IPLD data structure from the beginning:

BenchmarkRoundtripSuccess/test-20-10000-16               255       4662330 ns/op     3409686 B/op      56307 allocs/op

Alright this is probably the actual baseline fair comparison, and we shaved off 2k of the 8K overhead.

Now, what if we do a code gen’d node? (I did this — it was surprisingly easy)

BenchmarkRoundtripSuccess/test-20-10000-16               256       4776305 ns/op     3154372 B/op      52351 allocs/op

Woohoo now we’re only 2K allocs/op off cbor-gen performance

Finally what if we write a fast path CBOR encode/decode (I had to hand code this based of the cbor-gen’d code, but I suspect it could be code-gen’d):

BenchmarkRoundtripSuccess/test-20-10000-16               262       4738194 ns/op     2949574 B/op      51653 allocs/op

(actually this was the worst run — usually it comes in around 51000 allocs/op)

Basically we’re down to cbor-gen performance -- off by maybe 500-1000 allocs/op — and note the benchmark runs faster as we go down — the total number of times run is the same with cbor-gen. I assume the tokenization with refmt is still a small price that gets paid.

Memory profiles: Profiles.zip ipld-prime-blank = ipld-prime, no changes cbor-gen = cbor-gen version remove-copy = use basicnode, but remove intermediate go data structure custom-node = use codegen’d node custom-node-custom-cbor-fast-path = use codegen’d node with custom cbor serializer

Also, some general feedback:

More code gen methods: one of the hurdles to usability when you’re use to working directly with go data structures is working within the confines of the ipld.Node interface. The code gen removes some of this challenge with the FieldXXX methods, but that only works for individual fields — once you get to arrays (or structs in structs) I find myself type-casting to get back to the code gen’d version a lot. The more codegen shortcut methods you can provide, the more pleasant to work with the code becomes and the less boilerplate wrapper code you have to write. (this is if you’re coming from the cbor-gen or ipfs world of just working with go structs directly)

Wow: the current codegen is getting really impressive in terms of all the things it does for you — really quite an achievement! The fact that with code-gen + a fast path encode/decode you’ve basically gotten down to the performance of native go struct + fast path encode/decode is… very impressive.

Something to consider: it’s perhaps almost time you could code-gen selectors as a custom IPLD node, with special (hand-coded) methods to meet the selector interface. The ExploreRecursive/ExploreRecursiveEdge constraint (ExploreRecursiveEdge not valid if not underneath an ExploreRecursive) is probably not possible to express in type schema but presumably you could do it with a ValidateSelector method or something. Anyway, it’d be pretty neat if this were possible — having the selector have seperate data structure for “when it’s a node” and “when it’s an executable selector” creates some points of confusion in working with them.

Generally: go-ipld-prime is getting to be a really cool, robust library!

ipld / go-ipld-prime

Findings on performance #83