[Optimization] Replace `serde_json` with `speedy` for `ContractContext` and `ContractAnalysis` (de)serialization

serde_json is notoriously slow at serialization when compared to other popular binary serialization implementations.

The speedy crate is a very high-performance, light-weight and simple crate which according to these benches would greatly impact contract storage and retrieval if used on the large/complex ContractContext and ContractAnalysis type serialization.

NOTE: Depends on #4448 for binary storage, and is best done together with #4450 to further reduce on-disk size and #4442 for more efficient storage API's.

I suggest speedy based partly on benchmarking found here, partly on its ease-of-use, and partly on its activity & userbase. Since this is only node-local serialization it looks like a good choice -- it should not be used for consensus-critical serialization.

While speedy does make use of unsafe code, it is virtually impossible for a user to invoke any potential attacks as all input is: 1) Valid Clarity code, 2) The node's self-constructed ContractContext, or 3) The node's self-constructed ContractAnalysis.

The version should also be pinned as a precaution to ensure that no upgrades are made without testing version compatibility -- the risk otherwise being a node panic if deserializations would suddenly fail, however unlikely.

Replying to https://github.com/stacks-network/stacks-core/pull/4437#issuecomment-1969711639 here to keep the serialization discussion together with the ticket. @jbencin

There are some nice benchmarks of Rust serialization frameworks here. Both are similar in performance, and either one would be much faster than JSON, so the choice should depend on other aspects

Yeah, this is where I've looked as a source for encoders in Rust.

I specifically ruled out frameworks which are schema-based -- while that would be great for consensus-critical messaging, it's way too much overhead for node-local data, especially given the polymorphic nature of TypeSignatures, SymbolicExpressions and Values, all which are heavily used and can recurse within eachother.

Realistically I would probably choose MessagePack over anything else for a good balance between speed and rigidity, largely because I have very positive experiences with it in a production setting. Another contender is CBOR, which is largely based on MessagePack, but the encoders I tried were closer to JSON speeds when benchmarked with the above types. BSON is another widely-used format, but it's basically just a binary form of JSON and is neither known for its speed nor compactness.

it doesn't look like there was really any thought into forwards-/backwards-compatibility in the existing [Stacks] code, but then again field names are serialized

By this I was referring more to the fact that business objects are being directly serialized instead of storage-specific models, which is dangerous for the exact reasons you've pointed out.

On the other hand, many of these fast binary formats get their speed by omitting the extra metadata. Instead, they make the assumption that the application knows exactly how the data should be structured. This limits the validation you can do, and the portability between languages, CPU architectures, and even different versions of the application. Sometimes, the trade-off worth it, depending on the use-case.

For this particular case we're dealing with node-local data, i.e. data the node itself generates during the processing of blocks, so short of the Hiro Archive, portability isn't a large issue. I would personally rather add CLI commands to import/export chainstate in a well-defined format (where performance isn't an issue) which could work across all future node implementations -- i.e. a node written in C# or Java would most likely choose the encoding, backing database and data structures that works best within their ecosystems for local storage.

Is there any chance of silent data corruption, like blindly memcpying binary data into the wrong variables?

Speedy actually had a pretty OK safeguard solution for this with their #[speedy(constant_prefix = $expr)] attribute, where we could assign each field a short identifier, and deserialization will fail with an "expected constant xxx" error if the field it's trying to deserialize into's prefix doesn't match the value's prefix. But frankly, this is a potential risk in any sort of encoding, even hand-written, so the best thing we can do is ensure good test coverage+fuzzing regardless which solution we end up with.

In the event of an explicit failure, do we get meaningful error messages, like which struct/field it failed at?

If using the above attribute, the "expected constant xxx" gives you a hint of which field has failed.

Would this be more brittle than the current JSON encoding?

Absolutely -- but that will be the story with any performant binary encoding. Ensuring good test coverage of (de)serialization will be important to ensure no changes accidentally break the encoding/decoding (this is where fake is great as it makes it easy to do lightweight fuzzing in unit tests).

And to be fair, even JSON has its limitations. It's just as easy to rename a field in one of the above structs without realizing it's used in serialization.

Would we have to rebuild the chainstate for changes that would be backwards compatible if we were using JSON?

The risk of needing to re-encode (migrate) values is much higher, yes -- but that also implies changes to the above structures. The entire chainstate would not have to be re-built, i.e. the node doesn't need to re-process blocks; it would be a data migration of specific values in the database which should take minutes, not hours. This is the price we'll pay for moving to a more performant binary encoding -- changes to these types are extremely rare, however; the last change was during 2.1's hard fork iirc.

Speedy actually had a pretty OK safeguard solution for this with their #[speedy(constant_prefix = $expr)] attribute

Would be really nice if they had a feature flag to automatically derive a prefix for every field

But frankly, this is a potential risk in any sort of encoding, even hand-written, so the best thing we can do is ensure good test coverage+fuzzing regardless which solution we end up with

The big risk with non-self-describing, schema-less binary serialization is silent corruption. If you re-order fields in a struct, or swap out one field with different one of the same size, it will load invalid data into those variables and not complain, because the overall size is still the same. Sure, you can manually version it, but it's reasonably likely you'll forget to increment the version number. Testing this also is difficult, because to cover every possible migration combination you'll need O(n^2) tests, where n is the number of format versions you've published

Realistically I would probably choose MessagePack over anything else for a good balance between speed and rigidity

It's worth seeing if either speedy or rkyv had any optional features or extensions that would make them reasonably safe from silent corruption. Otherwise, I'd say that MessagePack or CBOR would be better here. We'd get some gains in speed and disk usage, without sacrificing safety

And to be fair, even JSON has its limitations. It's just as easy to rename a field in one of the above structs without realizing it's used in serialization

But at least you get an easy-to-understand and easy-to-fix error when that happens. It's like the difference between using C and Rust. You'll still get bugs in Rust, but not the nasty, hard-to-debug ones like memory leaks, buffer overruns, uninitialized variables, typecasting errors, race conditions, etc.

I managed to get rkyv working, using the following "safety" features:

validation: Enables validation support through bytecheck.
archive_le: Forces archives into a little-endian format. This guarantees cross-endian compatibility optimized for little-endian architectures.
strict: Guarantees that types will have the same representations across platforms and compilations. This is already the case in practice, but this feature provides a guarantee along with C type compatibility.

(strict mode solves any concerns with the Hiro Archive :pray: )

And the results look as follows:

Plain serialization/deserialization of Contract vs ContractAnalysis

	`json`	`msgpack`	`speedy`	`rkyv`
`ast`	`284.71 us` (✅ 1.00x)	`194.76 us` (✅ 1.46x faster)	`79.68 us` (🚀 3.57x faster)	`35.14 us` (🚀 8.10x faster)
`analysis`	`52.33 us` (✅ 1.00x)	`33.21 us` (✅ 1.58x faster)	`17.62 us` (🚀 2.97x faster)	`9.84 us` (🚀 5.32x faster)

Simulated insert/select patterns for Contract vs ContractAnalysis (including compression)

	`next`	`optimized`
`insert`	`680.21 us` (✅ 1.00x)	`281.31 us` (🚀 2.42x faster)
`select`	`306.99 us` (✅ 1.00x)	`45.14 us` (🚀 6.80x faster)

I'm going to play around with it now and see how many ways I can break it and check the quality of error messages as well.

I'd be interested to see how it handles the following cases. Take the following struct:

struct Person {
    age: u32,
    name: String,
    address: String,
}

What happens if you serialize this struct, then try to deserialize after doing the following?

Swap order of name and address (same type)
Swap order of name and age (different type)
Change name of any field
Remove any field (should fail, do we get a useful error message?)
Add a new field (should fail, do we get a useful error message?)

I think these tests would address all of my concerns

Also a test to see how it handles data alignment. Serialize as this struct:

struct Person {
    age: u8,
    salary: u32,
}

And deserialize as the following:

struct Person {
    age: u32,
    salary: u32,
}

struct Person {
    age: u8,
    height: u8,
    salary: u32,
}

Does it fail? Does it load garbage data into age or height?

Another one, serialize this struct with non-zero values:

struct Person {
    age: u16,
    salary: u16,
}

And deserialize as:

struct Person {
    age: u32,
}

Does it deserialize both values to a single u32?

Curious to see if the wire format matches the #[repr(C)]/#[repr(rust)] or #[repr(packed)]layout in memory

Here's the results from the first batch of tests:

Swap order of `name` and `address` (same type)
- :x: This case results in a silent fail (i.e. corrupted data).
Swap order of `name` and `age` (different type)
- :x: This case results in a silent fail (i.e. corrupted data).
Change name of any field
- :ok: This case works.
Remove any field (should fail, do we get a useful error message?)
- :warning: Removing the first field age works, but it could also be a coincidence - would need to test this more thoroughly.
- :x: Removing the second field name results in a silent fail (i.e. corruption).
- :x: Removing the third field address results in a silent fail (i.e. corruption).
Add a new field (should fail, do we get a useful error message?)
- :ok: This case results in a proper "out of bounds" error.

Which indicates that we would need to be very strict about the changing of serialized types. One thing which does make that slightly easier to spot is the numerous attributes that rkyv needs on serialized types, so if there are any changes in types with these attributes it's a sign that extra testing is warranted.

I'll give the remaining tests a shot tomorrow :)

stacks-network / stacks-core

[Optimization] Replace `serde_json` with `speedy` for `ContractContext` and `ContractAnalysis` (de)serialization #4449

Replying to https://github.com/stacks-network/stacks-core/pull/4437#issuecomment-1969711639 here to keep the serialization discussion together with the ticket. @jbencin