ipld / js-dag-cbor

dag-cbor codec for IPLD
Other
26 stars 17 forks source link

feat!: switch CBOR library to cborg #13

Closed rvagg closed 3 years ago

rvagg commented 3 years ago

BREAKING CHANGE!

See cborg for additional details.

I will follow up with a comment about performance characteristics.

We may want to hold off on merging this PR until https://github.com/ipld/specs/issues/342 is resolved because it would be good to get strictness about rejecting the IEEE754 specials into here if that's the way we're going to go because that will be a breaking change as well. I have flags in cborg already to handle this on decode and encode can be handled here.

rvagg commented 3 years ago

Performance

It's been very hard to pin down a consistent performance story. Because of the different approaches taken to both encode and decode taken by borc and cborg it's very dependent on the object forms and size and even the way that the codec is used.

The majority of performance differences come down to byte array handling in JavaScript and Node.js - allocations, copies, and UTF8 handling. borc's main performance key is being able to lean on Node.js' Buffer. This can be seen mainly for decode: borc does an up-front Node.js Buffer heap allocation and a memcpy of the incoming byte array into that and decode from there - which gives it all the performance benefits of Buffer and its various utilities even when you feed it a Uint8Array. So utf8 decoding, number reading (DataView is the browser alternative and it sucks) and slicing come in to play with this.

But this comes at a cost because the "heap" size needs to be known up-front. Which is where some of our headaches with borc have come in and the configureEncoder() and its auto-grow feature that we added a couple of years ago to handle larger objects. The heap persists while the instance of this codec persists. Decode a very large object and a large-enough heap will persist in memory afterward. Growing the heap has costs too because you discard the old Buffer and allocate a new one.

cborg doesn't use this approach, it just assumes a Uint8Array and parses through it. It will lean on Buffer utilities where available and where it can get away with it. But not in a way that forces it into the browser, it'll only try and do it when !process.browser and via global.Buffer to avoid triggering auto-bundling. But it can only make limited use of it when it has to assume the incoming bytes are simply a Uint8Array.

For encode the approach is totally different - borc allocates a plain array and puts bytes in it, converting it to a Uint8Array at the end. cborg allocates small chunks, fills them up, slices and dices, managing a list of chunks, and then concatenates them in the end (mostly incurring memcpy's, but occasionally just a slice). This is like a write version of bl and pretty handy for this type of process.

In narrow benchmark situations, configured for enough heap, borc is difficult to beat for different object types head-to-head. I've formed a suite of benchmarks that test against randomised IPLD-style data and the results vary a lot. For one-offs, borc is very slow. Its up-front needs are very large and its "fast" mode (that we use here) requires persisting a Decoder instance. I wasted a lot of time chasing minor improvements. That's basically what I've spent the last month doing with cborg.

But, when we end up consuming it, borc becomes a lot slower. The need to deal strictly with Uint8Arrays here mean some costly conversions along the way, for both encode and decode. We also do circular reference checking and replacing CIDs with these Tagged objects for encode. So we end up walking an entire object twice before we even hand it off to borc. cborg does circular reference checking itself and I created hooks in cborg to handle everything we need to do to prepare and check objects. So there's a single object walk for all of the encode process. Auto-growing the heap also imposes a cost.

The attempt to remove Buffer from borc and make it friendly for browser bundling resulted in more than halving both decode and encode speed by borc's own benchmarks. I think if we ran similar benchmarks at the @ipld/dag-cbor level that these differences might be lower because of the forced imposition of Uint8Array by this codec from outside. Some of my benchmaking involved testing against the branch in that PR and seemed to indicate this.

These are some benchmarks consuming @ipld/dag-cbor with borc vs cborg, various sizes of random objects that I'm interested in (using ipld-garbage):

rnd-100 @ 1,000
        encode (avg 42 b): cborg @ 145,438 op/s / borc @ 115,065 op/s = 126.4 %
        decode (avg 42 b): cborg @ 227,360 op/s / borc @ 206,573 op/s = 110.1 %
rnd-300 @ 1,000
        encode (avg 106 b): cborg @ 122,396 op/s / borc @ 82,172 op/s = 149 %
        decode (avg 106 b): cborg @ 146,816 op/s / borc @ 160,064 op/s = 91.7 %
rnd-nomap-300 @ 1,000
        encode (avg 84 b): cborg @ 202,189 op/s / borc @ 113,858 op/s = 177.6 %
        decode (avg 84 b): cborg @ 214,355 op/s / borc @ 201,394 op/s = 106.4 %
rnd-nofloat-300 @ 1,000
        encode (avg 117 b): cborg @ 132,985 op/s / borc @ 85,362 op/s = 155.8 %
        decode (avg 117 b): cborg @ 158,400 op/s / borc @ 151,470 op/s = 104.6 %
rnd-nostr-300 @ 1,000
        encode (avg 93 b): cborg @ 63,444 op/s / borc @ 38,413 op/s = 165.2 %
        decode (avg 93 b): cborg @ 71,598 op/s / borc @ 72,840 op/s = 98.3 %
rnd-nostrbyts-300 @ 1,000
        encode (avg 99 b): cborg @ 109,446 op/s / borc @ 55,426 op/s = 197.5 %
        decode (avg 99 b): cborg @ 111,956 op/s / borc @ 117,498 op/s = 95.3 %
rnd-1000 @ 1,000
        encode (avg 296 b): cborg @ 105,936 op/s / borc @ 54,780 op/s = 193.4 %
        decode (avg 296 b): cborg @ 110,225 op/s / borc @ 116,299 op/s = 94.8 %
rnd-2000 @ 1,000
        encode (avg 572 b): cborg @ 99,403 op/s / borc @ 41,749 op/s = 238.1 %
        decode (avg 572 b): cborg @ 92,907 op/s / borc @ 93,719 op/s = 99.1 %
rnd-fil-100 @ 1,000
        encode (avg 39 b): cborg @ 237,360 op/s / borc @ 156,106 op/s = 152.1 %
        decode (avg 39 b): cborg @ 274,942 op/s / borc @ 258,343 op/s = 106.4 %
rnd-fil-300 @ 1,000
        encode (avg 91 b): cborg @ 187,871 op/s / borc @ 106,255 op/s = 176.8 %
        decode (avg 91 b): cborg @ 199,600 op/s / borc @ 186,876 op/s = 106.8 %
rnd-fil-500 @ 1,000
        encode (avg 144 b): cborg @ 169,976 op/s / borc @ 92,259 op/s = 184.2 %
        decode (avg 144 b): cborg @ 163,685 op/s / borc @ 159,083 op/s = 102.9 %
rnd-fil-1000 @ 1,000
        encode (avg 251 b): cborg @ 165,503 op/s / borc @ 75,471 op/s = 219.3 %
        decode (avg 251 b): cborg @ 140,718 op/s / borc @ 135,864 op/s = 103.6 %
rnd-fil-2000 @ 1,000
        encode (avg 459 b): cborg @ 157,214 op/s / borc @ 57,143 op/s = 275.1 %
        decode (avg 459 b): cborg @ 111,776 op/s / borc @ 109,453 op/s = 102.1 %

So you can see the wide variability there. I really wanted decode to be much faster than that but it's really difficult to squeeze any more without improvements in JS TypedArray performance. There are some possibilities for flattening the decode process a little more but it will cost in other ways, so I'm avoiding that for now.

We also have real-world data to test against. I have a 3.5Gb CAR file containing all blocks that form a single Filecoin headers graph. I can test streaming read/decode time with both borc and cborg versions of @ipld/dag-cbor and also a read/decode + encode/write and compare total times. There's a lot of I/O, async and CAR file overhead in these numbers, but the size of the file helps drown that out.

On my machine I get numbers like this:

read-only
borc took 698.5 seconds
cborg took 494.9 seconds
difference: 70.8% / 1.41 x

read/write
borc took 4,297.6 seconds
cborg took 894.5 seconds
difference: 20.8% / 4.8 x

They result in the same output CAR file, so that helps verify consisten encode. Unfortunately it's not the same CAR file as the input file! This is because of Filecoin encoding non-UTF8 data into CBOR strings and JavaScript not being able to deal with that. But that's a separate problem to be solved; I have some thoughts about allowing a way to optionally reach in behind during decode when you know the block you're decoding has invalid UTF8 strings so you can at least access the data.

rvagg commented 3 years ago

I've added a commit that disallows NaN, Infinity and -Infinity, this is pending acceptance of https://github.com/ipld/specs/pull/344