Request: binary serialization/deserialization

CyberBrainX commented 7 years ago

Wouldn’t it be useful to have the possibility to serialize the json object to a more compact form? This would make sense for storing the container or sending it over an internet connection.

The MessagePack (http://msgpack.org/) format or some proprietary format directly derived from your internal representation would be fine. Or could you probably just provide some generic interface for the binary serialization/deserialization?

I think this approach could be more useable than just gzipping the whole JSON data.

As an example this library did implement some alternative representations of a json container: http://greatpanic.com/progdocs/UniversalContainer.html (But this library doesn’t work properly and doesn’t use C++ 11)

nlohmann commented 7 years ago

I also had a look at MessagePack, and it seems to be a reasonable approach. As it can express more than JSON, I hope that users would understand that the library would not be able to parse arbitrary MessagePack input.

nlohmann commented 7 years ago

I started to implement MessagePack. Commit 543745a contains a basic serializer/deserializer for JSON values as static functions. It works with a std::vector<uint8_t> and currently has no diagnosis information in case of a parse error.

Feedback would be greatly appreciated:

naming of the functions
static function vs. member function vs. free function
move implementation outside the main header?
organization of the functions (I am thinking of reorganizing the large if in the parser, maybe using re2c)
improvements of the low-level shifting/conversion stuff
byte vector vs. containers vs. ???

CyberBrainX commented 7 years ago

Wow, I couldn’t believe how fast you did implement a first version. And it seems to be working! I think it’s a great benefit being able to exchange data with the MessagePack world. Except binary data, but binary data and JSON is a bad match anyway (base 64 encoding).

I did test the space savings in some of my use cases and got the following results. In simple cases (small data set) with almost no numeric data the saving was 12% over the compacted JSON file. Zipping alone showed a 15% reduction. First converting to msgpack and zipping the result gave a 19% reduction. As expected, with more numeric data the compression rate did grow constantly. Nice!

The naming to_msgpack and from_msgpack makes sense and is future save if you like to add more serializers/deserializers.
I prefer the static function. It’s the right context and can be still used in a flexible way.
At the moment it doesn’t hurt to have just one header. It’s just more convenient.
I’am not sure if re2c would be necessary in the parsing part. A hand crafted parser could be faster (the same is true for the json part - rapid json). Using a switch statement (for the fixed values) or a jump table could possibly speed up parsing of the MessagePack data(needs some tests to be verified).
The low-level shifting/conversion stuff could be improved but the code will get more complicated (dealing with endianness).
A byte vector is fine. It has some nice properties. A lot of frameworks have chosen byte vectors for handling serialized data.

Thank you for your great JSON implementation!

user1095108 commented 7 years ago

I beg to differ. MsgPack is not standardized, whereas CBOR, for example, is. There are a myriad of binary data formats out there, but, for better or worse, CBOR is standardized in RFC 7049, whereas MsgPack is not.

marton78 commented 7 years ago

+1 for CBOR!

nlohmann commented 7 years ago

I shall have a look at CBOR.

CyberBrainX commented 7 years ago

Ok, standards are great, I like them. CBOR is a PROPOSED standard and could still get changed. The proposal has some nice properties and it seems to be an improved MessagePack protocol. Unfortunately it’s not widely adopted and there are a lot of poor implementations. There’s no modern C++ 11 implementation and the old style C++ implementation is looking more like C code.

user1095108 commented 7 years ago

Oh, you wanted to call a library, I thought you wanted to write everything yourselves. But anyway, even it is "only" a proposed standard, this is still better than MSGPACK. Should it change, we will know about it, but we might not find out if MSGPACK changes.

nlohmann commented 7 years ago

I started to also implement CBOR. As the implementation tends to be rather verbose, I am thinking of moving the binary serialization/deserialization into a new header. In any case, I need to add more test cases.

marton78 commented 7 years ago

Hi @nlohmann, I've had a look at your code. The implementation could be less verbose by

Exploiting the orthogonality between length and content type. You could write a decode_length helper function that is called for all content types that can have a length. The function would take the length from the first byte if first_byte & 0x1F < 0x18 and call decode_uint8 / 16 / 32 / 64 otherwise. Then it would return a pointer to the first content byte. Similar for encoding.
Writing a jump table with 256 entries. This way you would also be sure that you don't forget anything.

nlohmann commented 7 years ago

@marton78 Good points. There may be even more possibilities to generalize between the MessagePack and CBOR sections given the two formats are so similar.

nlohmann commented 7 years ago

@marton78 I started to refactor the code and indeed it got much more concise, see a820d68. However, I think even after more refactoring, the whole CBOR/MessagePack business needs some 1000 lines of code...

nlohmann commented 7 years ago

I really want to move the code for MessagePack and CBOR out of the main header (it's over 1000 lines of code...) and I need some ideas how to best do this.

My thoughts:

C++ does not allow to extend the basic_json class in an additional file, so I can't just move the static functions (to_msgpack, from_msgpack, etc.) to a new header.
I could move these functions in the nlohmann namespace. So far, I think I do not rely on private functions of the basic_json class, so this could be possible.

The syntax would then be:

// assuming vec is a std::vector<uint8_t>
json j = nlohmann::from_msgpack(vec);
auto vec2 = nlohmann::to_msgpack(j);

I would be happy about opinions, pointers to best practices, etc.

Furthermore, any feedback on the interfaces, in particular using std::vector<uint8_t> as input/ouput would be more than welcome.

gregmarr commented 7 years ago

You could keep it in the class but in another file through a #include in the middle of the class definition, but it's really ugly.

If you did need access to private functions/data, you could forward declare a msgpack class in the nlohmann namespace and make it a friend of basic_json.

Probably the easiest is to just put the declarations of the long statics in the class definition, and then put the full definitions in another header.

user1095108 commented 7 years ago

Just write a separate library. You are outgrowing the bounds of this one. Slap in some RPC while writing the new one :)

nlohmann commented 7 years ago

If I would include the other header (say, json_binary.hpp), I would end json.hpp's independence, right? I would like to have the binary stuff separate from the rest, so you would only need to include the other header if you need MessagePack/CBOR support.

gregmarr commented 7 years ago

Yes, the second header should only be needed for callers of those functions.

user1095108 commented 7 years ago

I think the best thing to do would be to make a new repository entirely.

TurpentineDistillery commented 7 years ago

Or keep the functionality in the same file, ifdefed-out by default. Whoever needs it can enable it with #define NLOHMANN_JSON_ENABLE_CBOR In the source, or via a compiler switch.

nlohmann commented 7 years ago

Any idea how to move the code into a second file without also needing a preprocessor macro?

TurpentineDistillery commented 7 years ago

Put declarations in the main file, and definitions in another. Such files are typically named *.inl

TurpentineDistillery commented 7 years ago

I.e what @gregmarr said

nlohmann commented 7 years ago

Or another idea (just thinking): What about splitting the code into several files for development and have them merge by a script into a single (large) header?

nlohmann commented 7 years ago

(Maybe my judgement of a large header is wrong, but I fear that letting it grow (a) increases compilation times, (b) makes understanding the code harder, and (c) pushes a lot of functionality into code that may only need a fraction.)

TurpentineDistillery commented 7 years ago

A big benefit of having one file with declarations and one-or-many files with definitions is that it would allow for explicit template instantiation, so that the compilation overhead related to json.hpp can be factored out into a separate translation unit.

nlohmann commented 7 years ago

I am currently cleaning up and bringing the test coverage to 100%. I hope to release CBOR/MessagePack by next week.

nlohmann commented 7 years ago

Merged e906933.

marton78 commented 7 years ago

Hi @nlohmann, great job! I have looked at your code, some remarks:

When converting single / double, you have reimplemented memcpy. Why don't you just use it? (or std::copy, would be actually even nicer).
you could #define all those magic numbers (or put them in an enum)
msgpack / cbor are so similar that a lot of code is duplicated. probably you could template those conversions based on serialization format (so, two structs: msgpack and cbor, both of which contain all magic numbers, limits for the integers to fit in the first byte and the conversion functions from small integer to one byte)

All in all very nice job, thanks!

nlohmann commented 7 years ago

@marton78 Thanks for the feedback.

Regarding memcpy: I need to have another look.
Regarding the magic numbers / reduce code duplication: There are in fact a lot of similarities between CBOR and MessagePack, but I doubt that creating a configurable super type from which both formats inherit makes the code much more readable. I may play around with it, but right now I'd rather focus on writing down the limitations of the implementations (i.e., which types are not supported).

nlohmann commented 7 years ago

@marton78 Sorry, but I can't find the spot where I can use memcpy/std::copy.

TurpentineDistillery commented 7 years ago

Maybe it's this part?

            case value_t::number_float:
            {
                // Double-Precision Float
                v.push_back(0xfb);
                const uint8_t* helper = reinterpret_cast<const uint8_t*>(&(j.m_value.number_float));
                for (size_t i = 0; i < 8; ++i)
                {
                    v.push_back(helper[7 - i]);
                }
                break;
            }

Also, if this part of code expects the number_float_t to be double, a static_assert would be appropriate?

marton78 commented 7 years ago

Yes, it's this part, std::copy_backward could be used here instead of memcpy though, since copying is done in reverse.

TurpentineDistillery commented 7 years ago

This looks like the endianness handling business? If so, it can't be assumed that the host platform is little-endian. The ntoh and hton family functions are there to take the host endianness into account and reverse the bytes efficiently as appropriate, but these are not portable. It looks like boost endian library also does this and compiles to efficient byte-swapping instructions too (not saying we should add boost dependency).

user1095108 commented 7 years ago

Lol definetly not, look at my swap file to get some ideas:

https://github.com/user1095108/generic/blob/master/swap.hpp

nlohmann commented 7 years ago

FYI: I asked to advertize the latest release on the CBOR (https://github.com/cbor/cbor.github.io/issues/25) and MessagePack (https://github.com/msgpack/msgpack/issues/221) website.

nlohmann / json

Request: binary serialization/deserialization #358