Bincode serialization - Githubissues

kanidm / webauthn-rs

An implementation of webauthn components for Rustlang servers

Mozilla Public License 2.0

483 stars 80 forks source link

Bincode serialization #352

Closed smessmer closed 4 months ago

smessmer commented 11 months ago

I did this

I tried to serialize and then deserialize a passkey using bincode

I expected the following

Getting the original passkey back

What actually happened

A runtime error from bincode, DeserializeAnyNotSupported

Version (and git commit)

0.5.0-dev, 840a6f5cc7f65bd3c585ab00293addf76f70d973

Operating System / Version

Linux

Any other comments

This happens because Base64UrlSafeData uses Deserializer::deserialize_any in its deserializer but that isn't supported by bincode. I see that the serializer uses serialize_str. Is it possible to also just use deserialize_str instead of deserialize_any? Or, maybe even better, use {serialize,deserialize}_bytes?

Bincode is an important serialization format because it has the lowest memory footprint of all formats I know. If you want to store a lot of passkeys for a large user base, the memory footprint matters.

This issue applies not just to bincode, but to all serialization formats that don't serialize type information, e.g. also postcard and others. deserialize_any only works for verbose serialization formats that store the type. See documentation Doesn't seem like that would be necessary for something like Base64UrlSafeData?

smessmer commented 11 months ago

Or actually, Base64UrlSafeData might not even be the right type here. Concretely, it is currently used to store key data in COSEKey. Base64 is necessary for sending data to/from the client but Passkeys usually stay server side only, so just encoding the key as a plain Vec<u8> might be better.

smessmer commented 11 months ago

Ok it's not only in the key, but also in the credential id object. I get that credential id sometimes needs to be encoded as base64 (e.g. when sent to the client), but I don't think it should be when stored on the server. Besides being incompatible with low-memory-footprint libraries like bincode, it also introduces storage overhead from the base64 encoding.

Firstyear commented 11 months ago

We store them as base64 on the server side, so that they are more compact in json dumps. We could probably change it from deserialize_any though to improve this - I think we were doing this to handle both vec and string inputs.

So I think this is an area we can improve base64urlsafe so that it works with other formats outside of cbor/json.

smessmer commented 11 months ago

Yes, I agree that when sending things to the client, serializing as json and wrapping things in base64 makes sense. It does introduce a lot of overhead though if you want to store things on the server. On our side, I was able to reduce the memory footprint of a passkey by 2.6x (from 370 bytes per passkey to 140 bytes per passkey), by copying the implementation of the Credential class (and all classes it depends on) into our own codebase and replacing Base64UrlSafeData with Vec<u8> everywhere. A 2.6x reduction in storage cost is huge for us. Part of the savings come from not using base64, and part of it come from being able to use bincode (we used cbor before that).

We were able to do this for Passkey instances because 0.5.0-dev allows access to the internals of Credential, so we could create our own version of it and copy data back and forth between our instances and the instances in the webauthn-rs library. However, for PasskeyAuthentication and PasskeyRegistration, the library doesn't allow access to the internals, so we wouldn't be able to do it for those unless we want to fork the whole webauthn-rs library into our code base and change it. Those are also objects stored on the server side and binary size of those matters to us as well. Also, while we now do have a solution for Passkey, having to maintain our own versions of the struct and its dependencies isn't very maintainable. Would be much better if we didn't have to do that.

Is there a way we can change the webauthn-rs library so it only uses base64 when necessary, e.g. when sending things to the client? Anything serialized on the server would be better off not using base64.

Firstyear commented 11 months ago

There actually is a way to access the internals, there is a feature called "danger passkey internals" I think. But I don't recommend it for obvious reasons. Messing with the credential internals can break things in amazing ways, and it may prevent future credential upgrades working between versions.

I'll need to think about that some more though. For us the opposite is true - we have a number of applications that all store data in json, and for them base64 is a huge space saving over vec due to how json represents u8 arrays.

I'm also curious about what you are doing where you are so constrained on byte sizes / memory availability. Are you using webauthn-rs on some embedded system or similar?

Perhaps the way to approach this is a feature flag or similar in webauthn-rs-core that changes how we store the data or similar. I'm not sure how it'd work cleanly though, it feels like it could be an invasive change.

smessmer commented 11 months ago

We're using danger-credential-internals already for what I mentioned above, but it only gives access to the passkey internals, not to the internals of PasskeyAuthentication or PasskeyRegistration.

Let me check what I can share about our use case. For now, let's just assume that we have a huge number of users, need to store passkeys in replicated in-memory databases, and storage is costly. Json itself has a lot of overhead as well and our database is not json based but can store binary blobs.

micolous commented 11 months ago

Base64UrlSafeData is essentially a serialiser annotation, and does not accurately describe its in-memory data format. It's still essentially a Vec<u8>, and you're not wasting much (or possibly any) memory at run-time there.

I believe your 2.6x comparison is misleading: you're comparing an optimally-packed, TLV-style format with field IDs with JSON and Base64. Using Base64 encoding in a similar TLV-style format should result in around 1.33x size increase – and that's the number we should be discussing. 😄

For better or worse, JSON is the primary serialisation format of the library right now, and the difference in using Base64 there is stark: storing a Vec<u8> as an Array<Number> is on average 3.55 bytes per byte (assuming no whitespace), plus a constant 2 byte overhead ([]). By comparison, storing it as web-safe Base64 is about 1.33 bytes per byte (we don't use padding), plus a constant 2 byte overhead ("").

Unfortunately, serde's API doesn't allow us to special case JSON (or any other format) for those fields, so we will need rather invasive modifications on our side to accommodate both, lest we cause a (best-case!) 2.7x storage cost increase for JSON.

The other challenge with allowing non-Base64 serialisation is going to be migration – all existing users are going to be using the Base64 format, and so we'll need to at least support both for deserialisation for some time (though, I suspect we already do to a degree). A user of the library would also need to have everything migrated to a "multi-format" reader before they can change over any writer.

I would posit that there are an extremely small set of organisations with credential databases large enough that any of this has a meaningful impact. While I agree it's not as efficient as it could be, it's sometimes cheaper for an organisation to accept an increased resource cost (in this case, storage and network) than paying (a) software engineer(s) to build a more efficient solution.

Longer term, I think we should replace Base64UrlSafeData with Vec<u8> and #[serde(with=...)] field annotations, because the naming is misleading, and makes all other code needlessly verbose. However, this won't actually save us a whole lot of memory or storage.

smessmer commented 11 months ago

Base64UrlSafeData is essentially a serialiser annotation, and does not accurately describe its in-memory data format. It's still essentially a Vec<u8>, and you're not wasting much (or possibly any) memory at run-time there.

Correct. My concern was around serialized at-rest storage (which we happen to use an in-memory database for). But we do need to serialize it into a database (including PasskeyRegistration and PasskeyAuthentication) instead of just keeping them in-memory in the server process, because a user might hit different computing nodes on our end, even during the same session.

I believe your 2.6x comparison is misleading: you're comparing an optimally-packed, TLV-style format with field IDs with JSON and Base64. Using Base64 encoding in a similar TLV-style format should result in around 1.33x size increase – and that's the number we should be discussing. 😄

The 2.6x is comparing the best of what is possible with today's implementation of the webauthn-rs library (which is cbor with base64) against what I was able to achieve by duplicating the data structures into our code base (which is bincode without base64). But you're correct that those are two independent optimizations.

Unfortunately, serde's API doesn't allow us to special case JSON (or any other format) for those fields, so we will need rather invasive modifications on our side to accommodate both, lest we cause a (best-case!) 2.7x storage cost increase for JSON.

Yes this is the heart of the issue I believe.

The other challenge with allowing non-Base64 serialisation is going to be migration – all existing users are going to be using the Base64 format, and so we'll need to at least support both for deserialisation for some time (though, I suspect we already do to a degree). A user of the library would also need to have everything migrated to a "multi-format" reader before they can change over any writer.

Also correct. We will be running into this issue as well since we already have a few beta users with existing passkeys using the old serialization, and we don't want to break the passkeys, PasskeyAuthentication or PasskeyRegistration objects we have already stored for them. Enabling a feature changing serialization behavior would be painful if there is no way to get the old behavior to read those existing ones.

I would posit that there are an extremely small set of organisations with credential databases large enough that any of this has a meaningful impact. While I agree it's not as efficient as it could be, it's sometimes cheaper for an organisation to accept an increased resource cost (in this case, storage and network) than paying (a) software engineer(s) to build a more efficient solution.

True. We are one of those few organizations.

The PR attached by @Firstyear, doing it by using a feature, seems fragile. As @Firstyear correctly pointed out, features taint. If application A uses webauthn-rs and depends on its default serialization, then transitively a library they depend on, B, depends on webauthn-rs with the feature, things break for A in unexpected ways. Or maybe they work today but then B changes later and adds the feature and breaks things for A. It doesn't sound great. This is why usually, features are meant to be purely additive features, not changes of an existing behavior.

Is it possible to instead have it in the API somehow? e.g. wrap Passkey into a SerializeAsBinary newtype, and that newtype changes how things are serialized? Maybe it doesn't use serde then but has a custom serialization implementation. Or it does something similar to what we're doing today where it .into()'s the whole Passkey struct into a duplicate struct that uses Vec<u8> instead of Base64UrlSafeData everywhere, and then it serializes that duplicate struct with serde. Using such an approach, applications could specify in their code how to serialize without having to rely on features for behavior changes. See here for some code on how we're doing it

smessmer commented 11 months ago

Hm another potential approach could be this. We could use Serializer::is_human_readable so that it serializes using base64 to json but binary to bincode.

Firstyear commented 11 months ago

@smessmer @micolous and I had a big talk about it yesterday evening and we do agree we should move away from base64urlsafe through out the codebase, but it will take some time to achieve. There are lots of possible ways to make this happen, but as we mentioned, the priority for us is making sure we can upgrade for users and making sure that we do this in a really reliable and safe way. I think he is working on a few possible ideas today that could help here.

micolous commented 11 months ago

But we do need to serialize it into a database (including PasskeyRegistration and PasskeyAuthentication) instead of just keeping them in-memory in the server process, because a user might hit different computing nodes on our end, even during the same session.

There are some tricks which can be done here on a routing level, like pinning that authentication "journey" to a single serving task. While this makes things less reliable if a task goes away (as the challenge would be unusable by other tasks), it avoids the need to serialise quite so much data during an authentication – but all of this is a trade-off.

Once the authentication is done, I presume the Passkey stuff doesn't matter at all – you've issued a session cookie, and the Webauthn stuff is no longer in the serving path of all authenticated requests.

The 2.6x is comparing the best of what is possible with today's implementation of the webauthn-rs library (which is cbor with base64) against what I was able to achieve by duplicating the data structures into our code base (which is bincode without base64). But you're correct that those are two independent optimizations.

Note that serde's default serialization mode for CBOR for structs is to store field names as strings, not numeric IDs.

webauthn-authenticator-rs has to do a lot of this messing about too, because the CTAP-level authenticator communication uses numeric field IDs, and push it into webauthn-rs-* structures.

True. We are one of those few organizations.

Doing some back-of-the-envelope maths here: assume you have 10¹⁰ Credential records, the bincode + raw bytes serialisation (140 bytes) comes out at 1.27 TiB for that dataset (excluding indexes etc.).

With a 230 byte increase (using your comparison of CBOR-with-string-field-names + Base64 vs. bincode + raw bytes), that makes the dataset 2.09 TiB larger.

I suspect the actual difference with bincode + base64 is around 50 bytes per record more (rounded up to allow for an increase in a length field's integer size), the dataset is 0.45 TiB larger.

While registrations are low-churn, authentications can update the counter and/or backup state flags. Within the limits of the serde API, I don't think there are any fancy things we could do to avoid entirely rewriting that serialised record, unless we allow clients to consider the Credential serialisation format as not opaque.

This isn't the true amount of disk and network usage because of redundancy and replication requirements, at the scale of 10¹⁰ records, I don't think it makes a whole lot of difference when the total size is still less than 5 TiB. 😄

The PR attached by @Firstyear, doing it by using a feature, seems fragile.

Yup, I agree.

Yesterday (before knowing about is_human_readable), I tried switching over to #[serde(with = ...)], but it doesn't handle things like Vec<T> or Option<T> for you – you need to provide separate serialize and deserialize functions for all of that. So as much as I'd like to move away from the Base64UrlSafeData type, it would make some of the implementation more complicated.

Hm another potential approach could be this. We could use Serializer::is_human_readable so that it serializes using base64 to json but binary to bincode.

That API looks like exactly what we want. It still has change-over risks, but there may be ways to manage that.

We could also make Base64UrlSafeData (or a replacement which uses is_human_readable) more transparent, like providing ::new methods to other structs which take generic parameters (impl Into<Base64UrlSafeData>), and more things that make it look like Vec<u8>. That should at least alleviate some of the complexity inside webauthn-rs itself.

Firstyear commented 11 months ago

The PR attached by @Firstyear, doing it by using a feature, seems fragile.

Yup, I agree.

But at least now we know what not to do :)

That API looks like exactly what we want. It still has change-over risks, but there may be ways to manage that.

As mentioned externally - I think we can "fix" base64urlsafe data to accept deserialising both vec and base64 so that "any representation" works, and then for the serialisation we choose based on is_human_readable flag. That way we maintain compatibility to anything that already exists, but new or re-serialised credentials will get the correct output for that type.

Separately, we likely should have base64urlsafe "removed" from places around the codebase that aren't serialised. We also could consider renaming base64urlsafe to "SerialisableBinaryData" to better show that it's just a wrapper for serialising/deserialising.

All of this probably needs to be done in some smaller steps to make sure we don't muck anything up in the process.

micolous commented 11 months ago

https://github.com/kanidm/webauthn-rs/pull/354 is a first step towards making this better from a serialisation perspective – it'll use Base64 encoding when is_human_readable() == true, or bytes otherwise.

Deserialisation will still accept multiple input types; but this still relies on using a self-describing format, so:

JSON is fine
CBOR is fine
bincode probably still won't work, and I don't think it can ever be made to work with that requirement
Protobuf is ambiguous for bytes/str/Message, so I don't think that can be made to work well

micolous commented 11 months ago

Also, it looks like Base64UrlSafeData didn't accept bytes types, only sequence of integers – so this will need a staged migration for non-JSON types.

smessmer commented 11 months ago

@micolous Why do you believe this won't work with bincode? Afaik, is_human_readable is a property of the serialization crate you're using, i.e. serde_json would always return true and bincode always false, it should not be something that is taken from the format and would require the format to be self-describing?

micolous commented 11 months ago

@micolous Why do you believe this won't work with bincode? Afaik, is_human_readable is a property of the serialization crate you're using, i.e. serde_json would always return true and bincode always false, it should not be something that is taken from the format and would require the format to be self-describing?

The problem is deserialisation: if we never attempted string/Base64 deserialisation for is_human_readable == false, then there would be no migration path for existing data stored as a Base64-encoded string.

smessmer commented 11 months ago

Oh I see. For our usecase, we have a version number in the serialization format so we could tell the library which format to deserialize. Other library users would likely know as well whether they're serializing an old-format or new-format passkey, but it would mean they have to go through an explicit migration process which we're willing to do but not every library user might be happy about. Also, I'm not sure what the best way would be to pass that configuration to the deserializer.

smessmer commented 11 months ago

There's a potentially additional use case to consider that came up in one of my side projects (much smaller project just for fun, not what I talked about above). There, I am experimenting with the leptos web framework, which allows building full-stack apps, using Rust both on server and client side with wasm. In WASM, I tried to serialize the webauthn-rs challenge objects using serde_json directly into the JavaScript-side Webauthn Browser API, and then deserialize the client responses back to webauthn-rs objects. It mostly worked, but the base64 serialization broke it because the browser expected an ArrayBuffer with the challenge data instead of a base64 string.

micolous commented 11 months ago

Compatibility is one of the considerations when changing over to HumanBinaryData – before #354, Base64UrlSafeData couldn't deserialise CBOR bytes types, which HumanBinaryData uses.

I feel like versioning is something webauthn-rs should do itself, and also move towards an opaque binary serialisation format. That will cause some short-term pain for everyone to migrate their existing data storage away from whatever they're using, but it'll mean that we could throw a fairly-compact binary format at library users and they don't have to care about what their (de)serialiser can and can't do – if it can store a Vec<u8>, then that's fine.

As for browser APIs, from memory, some of the types are different because the browser API specs don't agree with the CTAP specs. But that sounds like a separate issue that'd need further investigation.