stephenberry / glaze

Extremely fast, in memory, JSON and interface library for modern C++
MIT License
1.21k stars 120 forks source link

Thoughts on untagged binary format #687

Closed kalradivyanshu closed 10 months ago

kalradivyanshu commented 10 months ago

I tried the untagged binary format, and it just works, which is amazing. I have a few questions tho. First, it really depends on how you define the data:

struct A {
    uint64_t a;
};
struct A1 {
    uint8_t a;
};
A a;
a.a = 20;
A1 a1;
a1.a = 20;

in this example, a will take 8 bytes and a1 will take 1 byte, even though both represent the same data. I think this is because BEVE treats uint8_t as 1 byte and uint64_t as 8 bytes, no matter the input content. I honestly think a much better way would be all unsigned integers are compressed integer and signed integers are compressed integer with one extra byte to indicate sign.

Also, I dont know if headers are needed? I overcome them in my own code which uses refl-cpp by just having function overrides, so I dont need to know what the data represent, just every part of the structure reads its part, increments the offset for the next part and so on:

write(std::string& data);
write(uint64_t data);
write(bool data);
...
read(std::string& data);
read(uint64_t& data);
read(bool& data);

And then just iterate over reflections and call write or read:

template <typename T>
void serialize(T& value) {
    for_each(refl::reflect(value).members, [&](auto member) {
        if constexpr (!is_readable(member)) {
            return;
        }
        write(member(value)); //or to read: read(member(value));
    });
}

I wrote a small dummy writer using refl-cpp (I am not very fluent with C++20 concepts) and added it to https://github.com/kalradivyanshu/glaze_v2_issue (clone and run ./run_example.sh)

for this struct:

struct SD {
    uint8_t t;
    std::string sid;
    uint8_t sn;
    uint8_t ln;
    uint8_t sln;
    bool k;
    uint64_t sn_;
    uint8_t ffs;
    uint64_t fls;
    uint64_t fsn;
    uint64_t fgn;
    uint64_t ct;
    uint64_t t_;
    uint64_t fn_;
    uint64_t d_s;
    uint64_t pt;
    uint64_t fst;
    size_t dl;
};

glaze untagged is 121 bytes and writer is 38 bytes. That is a big difference, specially since untagged is meant to be optimized for space. Would love to hear your thoughts (on my code quality too, since I am mid at c++ lol).

Thanks for all your hardwork!

stephenberry commented 10 months ago

Great thoughts. However, BEVE is highly concerned with performance. If you were to write all integers as compressed integers you would have a 10X or greater performance loss for large arrays (not being able to easily do memcpy). Also, if you simply use a compression algorithm on your BEVE data then you gain most of the compression benefits and it becomes entirely opt in.

BEVE is designed to be easily compressed, a value of 20 in a uint64_t means that you have 7 consecutive zero bytes. If you often have numbers like this then a compression algorithm will easily handle it.

When it comes to headers, they are necessary if the data is to be written to file and loaded by another program without having a schema. I much prefer schema less formats, as they are much easier to debug and allow files to be archived long term without needed to save matching schema documents. BEVE is also designed to convert directly to/from JSON, so that's another requirement for headers.

I do like the idea of a header-less binary format that focuses on minimizing memory. I think it would be a good addition to Glaze. If you wanted to add this raw binary format to Glaze, I would be happy to merge it in. But, you may find that simply using a compression algorithm would solve your issues.

kalradivyanshu commented 10 months ago

Couldn't a compromise be if non array compressed integers, if array, sized integers?

I do get the compression argument, i just worry about the performance impact of compressing (my use case is sending a lot of data on udp, so compressing over and over 1.5kb of packets, is not super efficient) i do agree with it can work for storing in files.

Regarding the header-less format, if I were to implement it in glaze, where do I start, can you give me some pointers?

stephenberry commented 10 months ago

Couldn't a compromise be if non array compressed integers, if array, sized integers?

It actually isn't a good compression mechanism for integers:

I do get the compression argument, i just worry about the performance impact of compressing...

I'll note that another argument for compression is that if you have strings and care about size (and network performance), then you probably should be compressing your data. Because compressing strings will significantly save space and therefore transfer time.

High speed compression algorithms will run faster than 500 MB/s, and sending less data over UDP will also improve performance. So, you will likely gain back the compression time by needing to transfer less data. I think LZ4 is probably an excellent choice for your use case.

I would like to add some compression helpers to Glaze, to make it easier to work with BEVE and compression, and at my work I actually have the need for high speed compression as well. So, I'll be working on this in the near future. One thing to note is that if your system would allow two cores for serializing data, then we can actually run the compression algorithm in parallel with the BEVE serialization. This would mean that there would be almost zero overhead to compression, but it would use another thread. I'll write up an issue for this, because it is a feature I would like to have.

Regarding the header-less format, if I were to implement it in glaze, where do I start, can you give me some pointers?

I think the BEVE format works for everything you want, except for headers within structs and tuple-like arrays.

Thanks for getting me to consider this more, because I'm now thinking we don't need to implement a completely new format. Rather, I think we can add BEVE extensions for raw-byte objects and arrays. These wouldn't be schema-less, but would be great for where size is critical. And, I think adding them to a format that is generally schema-less and allows tags is a benefit, because the user can decide how much introspection they want versus message size.

I'll make a performance note as well. That if a C++ struct is_standard_layout (holds trivial types like ints, bool, and floats) then we don't have to iterate over the elements of the struct and can simply memcpy the entire struct. This will provide a significant performance improvement for these kinds of structs and is extra motivation to support this header-less format.

In conclusion, hold off on implementing a header-less format until I've figured out how best to add it to BEVE. In the meantime, I would recommend experimenting with LZ4 and see if it helps you.

kalradivyanshu commented 10 months ago

Thank you so much for such a detailed response. I didn't think about the integer encoding, I will definitely look into LZ4 encoding! Thanks!