quininer / cbor4ii

CBOR: Concise Binary Object Representation
MIT License
54 stars 5 forks source link

Support for RFC-7049 Canonical CBOR key ordering #13

Closed vmx closed 2 years ago

vmx commented 2 years ago

This library explicitly specifies RF-8949, so this request might be out of scope.

In the project I want to use cbor4ii for I'm stuck with RFC-7049 Canonical CBOR key ordering. This means that keys are sorted by their length first. I wonder if that could perhaps be added behind a feature flag. Here is an implementation the seems to work. I didn't create a PR as this clearly needs more discussion first.

    fn collect_map<K, V, I>(self, iter: I) -> Result<(), Self::Error>
    where
        K: ser::Serialize,
        V: ser::Serialize,
        I: IntoIterator<Item = (K, V)>,
    {
        #[cfg(not(feature = "use_std"))]
        use crate::alloc::vec::Vec;
        use serde::ser::SerializeMap;

        // TODO vmx 2022-04-04: This could perhaps be upstreamed, or the
        // `cbor4ii::serde::buf_writer::BufWriter` could be made public.
        impl enc::Write for Vec<u8> {
            type Error = crate::alloc::collections::TryReserveError;

            fn push(&mut self, input: &[u8]) -> Result<(), Self::Error> {
                self.try_reserve(input.len())?;
                self.extend_from_slice(input);
                Ok(())
            }
        }

        // CBOR RFC-7049 specifies a canonical sort order, where keys are sorted by length first.
        // This was later revised with RFC-8949, but we need to stick to the original order to stay
        // compatible with existing data.
        // We first serialize each map entry into a buffer and then sort those buffers. Byte-wise
        // comparison gives us the right order as keys in DAG-CBOR are always strings and prefixed
        // with the length. Once sorted they are written to the actual output.
        let mut buffer: Vec<u8> = Vec::new();
        let mut mem_serializer = Serializer::new(&mut buffer);
        let mut serializer = Collect {
            bounded: true,
            ser: &mut mem_serializer,
        };
        let mut entries = Vec::new();
        for (key, value) in iter {
            serializer.serialize_entry(&key, &value)
               .map_err(|_| enc::Error::Msg("Map entry cannot be serialized.".into()))?;
            entries.push(serializer.ser.writer.clone());
            serializer.ser.writer.clear();
        }

        TypeNum::new(major::MAP << 5, entries.len() as u64).encode(&mut self.writer)?;
        entries.sort_unstable();
        for entry in entries {
            self.writer.push(&entry)?;
        }

        Ok(())
    }

I'd also like to note that I need even more changes for my use case (it's a subset of CBOR), for which I will need to fork this library. Nonetheless I think it would be a useful addition and I'd also prefer if the fork would be as minimal as possible. I thought I bring it up, to make clear that it won't be a showstopper if this change wouldn't be accepted.

quininer commented 2 years ago

I think that's a bit beyond the scope of this crate.

Can this be done outside of this crate? we could have a serde-cbor4ii-canonical crate that wraps the Serializer and overrides the collect_map method.

like

struct CanonicalSerializer<W>(Cbor4iiSerializer<W>);

impl<W: Write> Serializer for &mut CanonicalSerializer<W> {
    type SerializeSeq = <Cbor4iiSerializer<W> as Serializer>::SerializeSeq;

    // forward to Cbor4iiSerializer

    fn collect_map<K, V, I>(self, iter: I) -> Result<(), Self::Error> {
        // your code
    }
}
vmx commented 2 years ago

I think that's a bit beyond the scope of this crate.

That's fine. But it brings me to the next point (let me know if I should open a separate issue). As you can see on that implementation, I need access to some things (like TypeNum) which isn't public yet. What I think would be great if cbor4ii could be split into two crates, one core one and one for Serde (similar to what ciborium is doing. This way I could depend on the core crate, but create my own custom Serde implementation. That would help me a lot.

quininer commented 2 years ago

cbor4ii exposes MapStartBounded, you don't have to access TypeNum, You can notice that TypeNum is also not accessed in the serde mod. also if you don't turn on the serde1 feature then it's a core crate and it doesn't need to split out another crate.

The thing I can think of that prevents CanonicalSerializer from being impl externally is that Serializer might need to expose an as_mut to allow CanonicalSerializer to bypass the serde api for writes.

vmx commented 2 years ago

cbor4ii exposes MapStartBounded, you don't have to access TypeNum

Thanks! I missed that one, that is exactly what I needed.

also if you don't turn on the serde1 feature then it's a core crate and it doesn't need to split out another crate.

Oh right, I should've thought of this, I'll just do that.

The thing I can think of that prevents CanonicalSerializer from being impl externally is that Serializer might need to expose an as_mut to allow CanonicalSerializer to bypass the serde api for writes.

As I mentioned, I need a few other changes that are not really general purpose (like this sort order), hence don't make sense to be upstreamed/published in a generic way. So I thought I just fork the whole Serde part and copy it over. Though the idea of wrapping the serializer sounds interesting. I'll have a look and come back to you in case I need additional things exposed.


One more thing in regards to the original issue. I find it quite useful if you're able to serialize things into memory. In the above code I've implemented impl enc::Write for Vec<u8>. I've also seen that some tests of this library are using a BufWriter. It would be cool to have this kind of functionality (in whichever way) in the public API of the core library, so that it can be re-used.

quininer commented 2 years ago

I didn't initially think about how to expose them, so I chose not to. Currently you can implement enc::Write for your own types.

I would consider adding a mod that provides these helper type.

vmx commented 2 years ago

I would consider adding a mod that provides these helper type.

I find the impl enc::Write for Vec<u8> quite convenient to use. Should I create a PR implementing this? Or should I create a PR exposing BufWriter or should I just wait?

quininer commented 2 years ago

I would like to expose a BufWriter type, it would be nice if you were willing to open a PR.

vmx commented 2 years ago

Thanks a lot for all the help. I think I can now solely rely on the core features (without serde1) and do my own Serde implementation (I still want to experiment with wrapping yours, but that's not related to this issue.

For those who need this kind of collation, this is the version updated for cbor4ii 0.2.20:

fn collect_map<K, V, I>(self, iter: I) -> Result<(), Self::Error>
where
    K: ser::Serialize,
    V: ser::Serialize,
    I: IntoIterator<Item = (K, V)>,
{
    use serde::ser::SerializeMap;
    #[cfg(not(feature = "use_std"))]
    use crate::alloc::vec::Vec;
    use crate::core::utils::BufWriter;

    // CBOR RFC-7049 specifies a canonical sort order, where keys are sorted by length first.
    // This was later revised with RFC-8949, but we need to stick to the original order to stay
    // compatible with existing data.
    // We first serialize each map entry into a buffer and then sort those buffers. Byte-wise
    // comparison gives us the right order as keys in DAG-CBOR are always strings and prefixed
    // with the length. Once sorted they are written to the actual output.
    let mut buffer = BufWriter::new(Vec::new());
    let mut mem_serializer = Serializer::new(&mut buffer);
    let mut serializer = Collect {
        bounded: true,
        ser: &mut mem_serializer,
    };
    let mut entries = Vec::new();
    for (key, value) in iter {
        serializer.serialize_entry(&key, &value)
           .map_err(|_| enc::Error::Msg("Map entry cannot be serialized.".into()))?;
        entries.push(serializer.ser.writer.buffer().to_vec());
        serializer.ser.writer.clear();
    }

    enc::MapStartBounded(entries.len()).encode(&mut self.writer)?;
    entries.sort_unstable();
    for entry in entries {
        self.writer.push(&entry)?;
    }

    Ok(())
}
vmx commented 2 years ago

I re-open this issue as I'm coming back to your proposal from https://github.com/quininer/cbor4ii/issues/13#issuecomment-1088934557, where I try to wrap your serializer.

It almost works. The problem is the serializer to serialize the map. In the code from the comment above, I use:

let mut serializer = Collect {
    bounded: true,
    ser: &mut mem_serializer.0,
};

But currently Collect isn't public (with good reason).

I then tried it with:

let mut serializer = Self::SerializeMap {
    bounded: true,
    ser: &mut mem_serializer.0,
};

But that errors with:

error[E0308]: mismatched types
   --> src/ser.rs:290:16
    |
78  | impl<'a, W: enc::Write> ser::Serializer for &'a mut Serializer<W> {
    |          - this type parameter
...
290 |           ser: &mut mem_serializer.0,
    |                ^^^^^^^^^^^^^^^^^^^^^ expected type parameter `W`, found `&mut cbor4ii::core::utils::BufWriter`
    |
    = note: expected mutable reference `&mut cbor4ii::serde::Serializer<W>`
               found mutable reference `&mut cbor4ii::serde::Serializer<&mut cbor4ii::core::utils::BufWriter>`

Next try (thanks to the help from a colleague) was:

let mut serializer = <&mut Serializer<_> as serde::Serializer>::SerializeMap {
    bounded: true,
    ser: &mut mem_serializer.0,
};

That would work, but sadly only on nightly due to https://github.com/rust-lang/rust/issues/86935.

So what to do now? Is making Collect public an option (it would also need to have constructor then)?

quininer commented 2 years ago

We should be able to implement a separate Collect for this with just a few lines of code.

or we just need to serialize these two objects into the same buffer, it doesn't necessarily need SerializeMap.

like

    let mut buffer = BufWriter::new(Vec::new());
    let mut entries = Vec::new();
    for (key, value) in iter {
        let mut mem_serializer = Serializer::new(&mut buffer);
        key.serialize(&mut mem_serializer)?;
        value.serialize(&mut mem_serializer)?;
        entries.push(buffer.buffer().to_vec());
        buffer.clear();
    }
vmx commented 2 years ago

or we just need to serialize these two objects into the same buffer, it doesn't necessarily need SerializeMap.

Wow, that's even better. I've been spending so much time on Serde, but I'm still not good at it. Thanks a lot this works! I just need to look into some error handling issues, but that should be easy.

vmx commented 2 years ago

Here comes the full source of the updated version:

fn collect_map<K, V, I>(self, iter: I) -> Result<(), Self::Error>
where
    K: ser::Serialize,
    V: ser::Serialize,
    I: IntoIterator<Item = (K, V)>,
{
    // CBOR RFC-7049 specifies a canonical sort order, where keys are sorted by length first.
    // This was later revised with RFC-8949, but we need to stick to the original order to stay
    // compatible with existing data.
    // We first serialize each map entry into a buffer and then sort those buffers. Byte-wise
    // comparison gives us the right order as keys in DAG-CBOR are always strings and prefixed
    // with the length. Once sorted they are written to the actual output.
    let mut buffer = BufWriter::new(Vec::new());
    let mut entries = Vec::new();
    for (key, value) in iter {
        let mut mem_serializer = Serializer::new(&mut buffer);
        key.serialize(&mut mem_serializer)
            .map_err(|_| enc::Error::Msg("Map key cannot be serialized.".into()))?;
        value
            .serialize(&mut mem_serializer)
            .map_err(|_| enc::Error::Msg("Map key cannot be serialized.".into()))?;
        entries.push(buffer.buffer().to_vec());
        buffer.clear();
    }

    enc::MapStartBounded(entries.len()).encode(&mut self.0.writer())?;
    entries.sort_unstable();
    for entry in entries {
        self.0.writer().push(&entry)?;
    }

    Ok(())
}

My Serializer looks like this:

pub struct Serializer<W>(cbor4ii::serde::Serializer<W>);

impl<W> Serializer<W> {
    pub fn new(writer: W) -> Serializer<W> {
        Serializer(Cbor4iiSerializer::new(writer))
    }
}