CBOR byte string support - Githubissues

nlohmann / json

JSON for Modern C++

https://json.nlohmann.me

MIT License

42.5k stars 6.68k forks source link

CBOR byte string support #1129

Closed daviddesberg closed 5 years ago

daviddesberg commented 6 years ago

Currently, the CBOR feature of the library only supports writing strings out as UTF-8 and doesn't support byte arrays as described in the CBOR standard. I'd like to implement COSE on top of this library, but in order to do that, byte string support is needed.

I'm happy to submit a pull request for this, but I wanted to check in advance if this is something you'd be open to

nlohmann commented 6 years ago

How would you store the data coming from CBOR?

nlohmann commented 6 years ago

Related: #757, #587.

daviddesberg commented 6 years ago

my thought was creation of some byte string type, leaving existing treatment of strings alone. i'll have time this week to show in more detail

nlohmann commented 6 years ago

So you mean the byte string's bytes should just be parsed and stored in a JSON string? What about encodings? Please also have a look at the issues I referenced above.

nyovaya commented 6 years ago

@nlohmann How about adding the suffix _BIN or _BINARY to the key string and the string then contains the byte string as a base64 converted string?

nlohmann commented 6 years ago

What about an array of binary strings? What about an array of values of which some are binary strings? How would roundtripping work?

nlohmann commented 6 years ago

Any news on this?

nlohmann commented 6 years ago

@Lusitanian @nifker Any input on this?

nyovaya commented 6 years ago

My approach would be still using a prefix/suffix for the key and for the data itself base32768 or base64 - but I dont what performance base32768 has or usable it is. An example would look like this:

BINARY_binarystring: "媒腻㐤┖ꈳ埳"

binarystring_BINARY: "XmFFzQtc"

The prefix/suffix is trimmed from the key string and the resulting string is used as key in CBOR. The value is just decoded from base64/base32768 back into a binary string and will be the value in CBOR.

nlohmann commented 6 years ago

I do not like this approach, as it relies on the convention to use and respect such prefixes/suffixes.

nyovaya commented 6 years ago

Couldn't imagine how else you want to detect that.

nlohmann commented 6 years ago

There are two aspects:

Parsing CBOR to JSON (deserialization). There is currently no support for binary strings. It could be added, e.g. by either implementing encoders for base64 and storing the result in a string or by extending the SAX interface to pass std::vector<uint8_t> to the caller so the user can decide how to interpret the received bytes. The first approach has the benefit of being close to CBOR. The second approach has the benefit of being very generic and shifting the decoding work to the user, allowing the library to stay slim in this point.
Creating CBOR from JSON (serialization). We follow the JSON types right now and map all strings to CBOR's UTF-8 strings. In order to tell the serializer to encode specific strings differently, one could think of an additional parameter to the serializer that contains a mapping from JSON Pointers to desired CBOR types; that is, allowing to explicitly override the default types where desired. This would be a lot of work, but doable in principle.

It would be nice to hear more alternatives to these two aspects. But since the library needs to be maintained and support is very time-consuming, I am against an approach that relies on non-standard assumptions or conventions that need documentation and may often lead to surprises.

I'm setting this issue to discussion, and I hope to see some ideas. I am currently not planning to work on this topic, so pull requests are, as always, more than welcome! As an additional challenge, we should not forget the library also implements MessagePack and UBJSON, so ideas that also have those binary formats in mind are appreciated.

nlohmann commented 6 years ago

Any opinion on this?

alexkarpenko commented 6 years ago

How about adding a data type for binary blobs?

std::vector<uint8_t> myData{0,1,2,3,4,5,6};
json j = {
  {"binary", json::bin(myData.begin(), myData.end())}
};

auto v_msgpack = json::to_msgpack(j);
json j2 = json::from_msgpack(v_msgpack);

assert(j2["binary"].is_binary() == true);
auto dat = j2["binary"].get<std::string>(); // OK. Returns a string that holds the bytes (json::bin can just be a tagged struct with a std::string member)

auto str = j.dump(4); // base64 encodes binary fields
auto j3 = json::parse(str);
assert(j3["binary"].is_binary() == false); // binary data serialized to json will not survive round trip. instead it's now a base64 encoded string

nlohmann commented 6 years ago

Hm. Nice idea. Any opinions on this?

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

arrtchiu commented 5 years ago

As someone who wants to use MessagePack with binary blobs, I quite like @alexkarpenko's suggestion of a separate type. It's nice and explicit and doesn't rely on developers sticking to conventions.

One point I'm not completely sold on is automatically converting it to base64 string when serialising to JSON. I personally consider adding binary to an object and then trying to serialise it into JSON a developer error that should probably be an exception (if you want to use binary you should serialise to a format that supports it), but I don't hold that opinion too strongly either.

Another potential reason against base64-encoding is.. well.. does this library already have a base64 encoder? We probably don't want to add one just for that! :)

I don't have the time or expertise to complete this myself but I'm open to the possibility of contributing financially to see this done.

alexkarpenko commented 5 years ago

I don't feel strongly about what to do when serializing binary blobs to JSON. Throwing an exception seems reasonable to me.

mdenna-nviso commented 5 years ago

We also would really like to have CBOR byte string support to be able to import large binary BLOBs without additional overhead. The suggestion proposed by @alexkarpenko's seems simple and clean and would be ideal for me. Regarding the export to JSON, an exception is the best way to go in my opinion. Binary blobs are not a native JSON type anyway, and this would avoid complicate workarounds...

ayounes-nviso commented 5 years ago

Is it possible to reopen this ticket please?

epvbergen commented 5 years ago

I think @alexkarpenko's approach is very good. A separate tag value is an elegant way to make the distinction between a binary string and a JSON string in the node and to steer the behaviour of serializers that preserve the distinction, resp. to throw an exception in serializers for formats that have do no way to preserve the distinction in a round trip (eg. JSON itself).

I would really like to see this feature and am willing to assist in testing, if needed. Right now I'm already using string nodes in a similar way, but I have to keep the flag elsewhere and I don't have the safety that would come from having it in the library.

I also agree with his suggestion to continue having the data in a std::string instead of overlaying a std::vector<char> in the union. The latter's interface offers little if anything over std::string, and std::basic_string<char> of course guarantees it can store any sequence of any value of 'char'.

(As everyone knows except people on StackOverflow, std::string does not make any assumptions about encodings. To std::string, an unknown number of glyphs represented as a concatenation of variable length UTF-8 sequences is no more or less weird than the same glyphs represented as a series of JPEG images, unless you call c_str() or std::string(const char ) which are the only things that interpret NULs in the string, to which data() and std::string(const char , size_t) are perfectly good alternatives).

If deemed necessary, an alternative for std::string as the underlying value for this tag could be to parameterize basic_json's with a type that has iterators to char, begin(), end() and insert(iterator, iterator, iterator) (if the latter is sufficient for the SAX parser). Then the user can choose between string or vector and the JSON node does not necessarily need to own the data, which can prevent an unnecessary extra copy when encoding to CBOR.

nlohmann commented 5 years ago

Let me write down some thoughts on this:

Conceptually, a basic_json value has two members: a type and a value. The type is this enum:

enum class value_t : std::uint8_t
{
    null,             ///< null value
    object,           ///< object (unordered set of name/value pairs)
    array,            ///< array (ordered collection of values)
    string,           ///< string value
    boolean,          ///< boolean value
    number_integer,   ///< number value (signed integer)
    number_unsigned,  ///< number value (unsigned integer)
    number_float,     ///< number value (floating-point)
    discarded         ///< discarded by the the parser callback function
};

Adding binary here would be the least of all problems.

The value is a union:

union json_value
{
    /// object (stored with pointer to save storage)
    object_t* object;
    /// array (stored with pointer to save storage)
    array_t* array;
    /// string (stored with pointer to save storage)
    string_t* string;
    /// boolean
    boolean_t boolean;
    /// number (integer)
    number_integer_t number_integer;
    /// number (unsigned integer)
    number_unsigned_t number_unsigned;
    /// number (floating-point)
    number_float_t number_float;

    // (some constructors)
}

We could reuse string_t* string, but I think it would be better to add binary_t* binary to separate the actual string type from the string we want to use to store binary values. Furthermore, we would break some invariants if string != nullptr, but type != value_t::string.

Any thoughts on this?

ayounes-nviso commented 5 years ago

A new binary type makes a lot of sense semantically speaking.

IMHO using vector as the underlying type would be nice because that's what a lot of software stack would do (ours included :-) ) and this would then minimize data copy between containers.

ayounes-nviso commented 5 years ago

BTW, in case of JSON export adding the capability to optionally export the binary data as hexadecimal string would be very nice for debugging purpose. But throwing an exception by default is fine as well.

epvbergen commented 5 years ago

@nlohmann Sounds excellent. Do you intend to have binary_t as be a template parameter to basic_json? Then both the vector, the string and the smart pointer folks can have it their way.

nlohmann commented 5 years ago

Yes. However, we need to find a common interface for the different types, like an iterator range or pointer+length.

grigorig commented 5 years ago

I also would like to have proper support for binary data. The separate data type sounds fine to me. I don't think it is a good idea to implicitly convert to base64 for JSON or anything the like, but it would be great to have some utility functions to do just that.

nlohmann commented 5 years ago

This issue should not just be about CBOR, but for the byte formats of all implemented binary formats:

CBOR: Major type 2 (byte string) would be mapped to JSON binary, and vice versa.
UBJSON has no dedicated binary data format, but proposes to encode binary data to an array of integers. Therefore, we can only support the JSON -> UBJSON direction.
MessagePack's bin format family is straightforward and similar to CBOR, allowing for a mapping between JSON binary and vice versa.
BSON distinguishes several binary sub types. We should only support the "Generic binary subtype" which allows a mapping between JSON binary and vice versa.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mdenna-nviso commented 5 years ago

We're still very interested in this feature as discussed in the posts here above.

epvbergen commented 5 years ago

I am too, and @nlohmann's proposal from Mar. 17 would be great. As to the input type, either iterator + length or 2 iterators are fine by me.

nlohmann commented 5 years ago

I tried adding a binary_t type to json_value, but there is a lot of SFINAE stuff going on that when using std::string for binary_t, compilation fails. This will need some more time which I do not have in the moment. PRs welcome..

OmnipotentEntity commented 5 years ago

JFYI, I am working on this. No promises. I haven't thought about how to tackle serialization to text JSON yet (extension? error?). I haven't yet gotten to the point where SFINAE chokes yet, but I'm sure that will be fun. You'll see a PR if I finish it.

OmnipotentEntity commented 5 years ago

I'm slowly making my way through this, and I had a question about one of the function prototypes.

virtual bool number_float(number_float_t val, const string_t& s) = 0;

All calling sites use "" as the string, and none of the implementations I could locate actually use the string for anything. I want to understand why this one is different just to ensure I don't make a basic mistake in my implementation for the binary type.

Is this for legacy reasons? If so, why is it not marked as deprecated?

nlohmann commented 5 years ago

It’s for the case when you actually parse JSON and want to pass the string representation of a the floating-point number to the consumer of the SAX events. This makes no sense for binary formats, so we pass the empty string there.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mdenna-nviso commented 5 years ago

Hi, any update on this topic? @OmnipotentEntity, how is your implementation going? I'm afraid I don't have time at the moment to understand the internals of nlohmann/json, I could maybe give a hand if there are some very specific parts to implement..

OmnipotentEntity commented 5 years ago

Sorry, most of the last few weeks have been eaten by moving house. Once I get my computer set up again, I can upload what I have to my GitHub, if you want to take a crack at it. I'm still intending on finishing this, I'm just slow due to real life reasons.

mdenna-nviso commented 5 years ago

Great, thanks for feedback! No hurry on my side, when you have something I will be pleased to be a beta-tester.. :)

OmnipotentEntity commented 5 years ago

Hi, I'm back, I finally got my office set up. Please be patient with me while I begin a new job.

I had a question about design. How do you want to handle serialization of a binary value to text JSON? There's three possibilities to my mind:

Refuse to serialize
Serialize to array of numbers or base64 string or something.
Extend JSON with a binary type
Do any combination of the above and allow the user to choose.

These have the following pros and cons to my mind:

Refuse to serialize. Pros: Logically makes sense. There is no valid JSON serialization that is reversible. Cons: Now we have a new throwing path that wasn't there. It may seem surprising that we can serialize to binary formats but not text. And this library was originally about serializing JSON.

Serialize to array of numbers. Pros: This is what some binary formats will require anyway. A good compromise. Cons: Not reversible.

Serialize to base64 string. Pros: Same as previous. And can devise a tag of some kind to automatically change format back to binary. Cons: Small chance that normal data matches the tagging system with surprising consequences.

Extend JSON. Pros: Reversible and unambiguous. Allowed to accept non-JSON spec conforming data per RFC. Cons: This extension would need to be essentially invented (as I cannot find any literature on this topic), and no one else will support it in the short term, and very few at best will support it in the medium to long term. Not allowed to generate non-JSON spec conforming data per RFC.

All of the above. Pros: Obviously if we do everything the user desires nothing can possibly be bad, right? Cons: Lots of work to implement. Need to decide default behavior. Documentation might become confusing, and raises the required knowledge for usage of the library.

I'm reaching the point where I'd have to make a decision on this. And while discussion on this topic has occurred in this thread. No real consensus was reached. So I'm looking for direction.

znark commented 5 years ago

I had a question about design. How do you want to handle serialization of a binary value to text JSON? There's three possibilities to my mind:
1. Refuse to serialize

Refuse to serialize (throw an exception?) by default, but add a hook where you can plug in your own, custom adapter which will produce JSON compatible datatypes out of the binary datatype.

Since not all binary data blobs are the same, it could be useful if the binary data container could, by itself, carry some meta information, such as a pointer to a specific callback function for converting that specific data blob to a JSON-compatible type, or a type hint as a string (relayed to the hook callback function as a parameter), or something to that effect.

Also, if the data is to be serialized as a named field in a JSON object, maybe the callback could be allowed to modify the name of the field in such cases. (Could be used e.g. for adding a suffix denoting the type as suggested by someone else above. Would allow for many different schemes.)

If you can think of a method of adding similar custom hooks for deserializing from plain JSON data to binary blobs (allowing round trips when serializing and deserializing binary data into a custom JSON-compatible representation), that would be most excellent as well, but maybe not as important at this stage.

znark commented 5 years ago

More food for thought: maybe you could use std::any for storing any kind of object in the in-memory JSON variant and have a protocol (define conversion functions etc.) for converting that to the generic bin datatype when applicable and a fallback for converting the type to standard JSON datatypes presentation if doing a plain JSON serialization.

mdenna-nviso commented 5 years ago

Hi OE, welcome back! Hmmmm.... since the main focus of this library is JSON and binary data is not a native JSON data type, option 1 (exception) looks good enough IMHO at this stage. Users wanting to serialize a json tree containing binary nodes can just replace them with whatever encoding they like before calling the serialization method. Callback hooks are an elegant way to achieve the same automatically and without data duplication, but if this requires additional complexities in the library, maybe can be postponed until there is a real need.

OmnipotentEntity commented 5 years ago

@znark

"such as a pointer to a specific callback function for converting that specific data blob to a JSON-compatible type"

We absolutely cannot store a raw pointer in serialized data. The pointer to that function isn't guaranteed to be the same from invocation to invocation on the same computer, let alone different computers with different operating systems and different processor architectures. A function identifier might be used, but as this is a library, not an application, and this would need to be something provided by the application anyway, I fail to see what we would gain exactly.

Python's builtin json type does have this functionality (callback hooks); however, their json type is also read-only, and Python is itself a weakly typed language, so a polymorphic return type will not cause extra boilerplate.

Putting the ability to store arbitrary types into json in this library would be quite a bit of work and require answering a bunch of difficult questions:

Should we allow only POD types? (Obviously not, but then...)
What if the type is not copyable?
What if the type is only moveable?
How do we keep from having the unknown type screw with the noexcept status of many different functions?
How do we report multiple different unknown types back to the user? The user will not always know a priori the proper type of the item, it could be one of several, and C++ has bad type introspection.

@mdenna-nviso my only real reservation about this approach is that it's a new throwing path. Meaning that current users of the library may run into an issue where code that had been working perfectly now throws. Currently the serializer only throws in two cases, both related to invalid UTF-8 strings, so it's safe to use without a catch if you either do not have strings, or you know the strings are valid UTF-8 (such as if you generated them yourself.) It's probably not a huge deal, but I didn't see you address it.

nlohmann commented 5 years ago

For parsing, the first step could be to extend the SAX interface with a binary function that could take a std::vector<std::uint8_t> or std::string or anything that can hold arbitrary binary data. Then the binary parsers (CBOR, MessagePack, ...) need to be adjusted to call this binary function with the binary data. With a default implementation of binary to throw a parse error, this would not break existing implementations.

A user who wants to use binary data would then need to implement the SAX interface including the binary function and decide what to do with the data. Maybe it would be helpful to also pass input_format_t so you'd know more about the origin of the binary data.

For writing binary data, the actual JSON value would need to be extended. It would never be filled by the standard JSON parser, and we may decide to throw if binary data is tried to be dumped. Once this exists, the binary writers (CBOR, MessagePack, ...) would need to write the respective encodings and we would be ready to roundtrip. We could even use the mechanism for arbitrary type conversion to let users define their own mappings from types to binary formats.

But extending the JSON value is a lot of work, because it is used everywhere. It's would not be impossible, just a long way...

What do you think?

OmnipotentEntity commented 5 years ago

I was in the middle of extending the JSON value itself to have a binary type. I'd say I'm about 40-50% complete or so. I'd prefer to do it this way rather than use an intermediate representation, because typically an application would want to parse binary data into a format external to JSON. (My use case in particular are neural network weights, for instance.)

nlohmann commented 5 years ago

Cool! How does the extension look like? Do you use a templated type?

OmnipotentEntity commented 5 years ago

Yeah, default type is std::vector<std::uint8_t>.

jpetkau commented 5 years ago

Throwing when serializing binary to json seems like a painful choice; it means losing the ability to easily switch formats, if you want to write binary data in formats that support it.

As an alternate suggestion, how about picking a serialization format (e.g. base64) for binary to json conversion, doing nothing special on load (so it doesn't roundtrip), but making it easy to parse base64 to binary in user-defined serializers?

More specifically,

add value_t::binary and a binary variant of json_value
Add a binary variant to sax_parse().
nlohmann::json to/from msgpack or CBOR roundtrips as binary
json::parse(string) always produces string, never binary

So far I think this is the same as agreed above and @OmnipotentEntity is implementing. Then:

dump() encodes binary as base64 encoded string
For API convenience: add a json::bin type (as suggested by @alexkarpenko above), which could just be std::vector<uint8_t> or some wrapper struct, but specifically not std::string.
get<json::bin>() returns a binary value as json::bin
get<json::bin>() parses string values as base64. (E.g. ("\"aGVsbG8=\""_json).get<json::bin>() == json::bin("hello", 5)

So nlohmann::json objects wouldn't roundtrip through json text (but that's expected since json doesn't support binary). They would roundtrip through CBOR and msgpack, and user-defined types could easily be made to roundtrip through all formats, and there would be no heuristics anywhere; the only conversion would be base64-encoding binary for json, and that would only happen when explicitly requested from C++ code.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Next