Thoughts on untagged binary format

I tried the untagged binary format, and it just works, which is amazing. I have a few questions tho. First, it really depends on how you define the data:

struct A {
    uint64_t a;
};
struct A1 {
    uint8_t a;
};
A a;
a.a = 20;
A1 a1;
a1.a = 20;

in this example, a will take 8 bytes and a1 will take 1 byte, even though both represent the same data. I think this is because BEVE treats uint8_t as 1 byte and uint64_t as 8 bytes, no matter the input content. I honestly think a much better way would be all unsigned integers are compressed integer and signed integers are compressed integer with one extra byte to indicate sign.

Also, I dont know if headers are needed? I overcome them in my own code which uses refl-cpp by just having function overrides, so I dont need to know what the data represent, just every part of the structure reads its part, increments the offset for the next part and so on:

write(std::string& data);
write(uint64_t data);
write(bool data);
...
read(std::string& data);
read(uint64_t& data);
read(bool& data);

And then just iterate over reflections and call write or read:

template <typename T>
void serialize(T& value) {
    for_each(refl::reflect(value).members, [&](auto member) {
        if constexpr (!is_readable(member)) {
            return;
        }
        write(member(value)); //or to read: read(member(value));
    });
}

I wrote a small dummy writer using refl-cpp (I am not very fluent with C++20 concepts) and added it to https://github.com/kalradivyanshu/glaze_v2_issue (clone and run ./run_example.sh)

for this struct:

struct SD {
    uint8_t t;
    std::string sid;
    uint8_t sn;
    uint8_t ln;
    uint8_t sln;
    bool k;
    uint64_t sn_;
    uint8_t ffs;
    uint64_t fls;
    uint64_t fsn;
    uint64_t fgn;
    uint64_t ct;
    uint64_t t_;
    uint64_t fn_;
    uint64_t d_s;
    uint64_t pt;
    uint64_t fst;
    size_t dl;
};

glaze untagged is 121 bytes and writer is 38 bytes. That is a big difference, specially since untagged is meant to be optimized for space. Would love to hear your thoughts (on my code quality too, since I am mid at c++ lol).

Thanks for all your hardwork!

Great thoughts. However, BEVE is highly concerned with performance. If you were to write all integers as compressed integers you would have a 10X or greater performance loss for large arrays (not being able to easily do memcpy). Also, if you simply use a compression algorithm on your BEVE data then you gain most of the compression benefits and it becomes entirely opt in.

BEVE is designed to be easily compressed, a value of 20 in a uint64_t means that you have 7 consecutive zero bytes. If you often have numbers like this then a compression algorithm will easily handle it.

When it comes to headers, they are necessary if the data is to be written to file and loaded by another program without having a schema. I much prefer schema less formats, as they are much easier to debug and allow files to be archived long term without needed to save matching schema documents. BEVE is also designed to convert directly to/from JSON, so that's another requirement for headers.

I do like the idea of a header-less binary format that focuses on minimizing memory. I think it would be a good addition to Glaze. If you wanted to add this raw binary format to Glaze, I would be happy to merge it in. But, you may find that simply using a compression algorithm would solve your issues.

Couldn't a compromise be if non array compressed integers, if array, sized integers?

I do get the compression argument, i just worry about the performance impact of compressing (my use case is sending a lot of data on udp, so compressing over and over 1.5kb of packets, is not super efficient) i do agree with it can work for storing in files.

Regarding the header-less format, if I were to implement it in glaze, where do I start, can you give me some pointers?

Couldn't a compromise be if non array compressed integers, if array, sized integers?

It actually isn't a good compression mechanism for integers:

No means of storing the larger values (63 or 64 bit) of uint64_t and int64_t, so these types would require another byte to indicate their type.
With smaller sized arrays compression is more valuable, so using compressed integers for size indicators makes a lot of sense. It is more important to save some bytes on an array of 3 values versus an array of 3,000 values. But, using this form of compression on integers in general means that we would be making our file larger. For example, uint8_t values from 65 - 255 would require an extra byte. So, we don't actually save anything for the majority of uint8_t values. The same is true for the other integer types, that 75% of the time (2 bits quarters our range) we don't get compression savings. The issue is made worse by the fact that we use power of 2 bytes to store integers. So if we were to store 16384 (2^14) in a uint16_t we would have to bump the storage integer to a uint32_t. This is adding 2 bytes to 75% of our uint16_t values. So, you can see that this is generally a bad compression algorithm for integers and really only makes sense for compressing sizes of arrays and objects. A compression algorithm like LZ4 will usually (statistically) be much more efficient than compressing integers in the manner that you and BEVE have implemented.
Using a compression algorithm will also find patterns in your numbers that are next to each other. So a compression algorithm will handle a bunch of zeros much better than using the BEVE size indicator compression.

I do get the compression argument, i just worry about the performance impact of compressing...

I'll note that another argument for compression is that if you have strings and care about size (and network performance), then you probably should be compressing your data. Because compressing strings will significantly save space and therefore transfer time.

High speed compression algorithms will run faster than 500 MB/s, and sending less data over UDP will also improve performance. So, you will likely gain back the compression time by needing to transfer less data. I think LZ4 is probably an excellent choice for your use case.

I would like to add some compression helpers to Glaze, to make it easier to work with BEVE and compression, and at my work I actually have the need for high speed compression as well. So, I'll be working on this in the near future. One thing to note is that if your system would allow two cores for serializing data, then we can actually run the compression algorithm in parallel with the BEVE serialization. This would mean that there would be almost zero overhead to compression, but it would use another thread. I'll write up an issue for this, because it is a feature I would like to have.

Regarding the header-less format, if I were to implement it in glaze, where do I start, can you give me some pointers?

I think the BEVE format works for everything you want, except for headers within structs and tuple-like arrays.

Thanks for getting me to consider this more, because I'm now thinking we don't need to implement a completely new format. Rather, I think we can add BEVE extensions for raw-byte objects and arrays. These wouldn't be schema-less, but would be great for where size is critical. And, I think adding them to a format that is generally schema-less and allows tags is a benefit, because the user can decide how much introspection they want versus message size.

I'll make a performance note as well. That if a C++ struct is_standard_layout (holds trivial types like ints, bool, and floats) then we don't have to iterate over the elements of the struct and can simply memcpy the entire struct. This will provide a significant performance improvement for these kinds of structs and is extra motivation to support this header-less format.

In conclusion, hold off on implementing a header-less format until I've figured out how best to add it to BEVE. In the meantime, I would recommend experimenting with LZ4 and see if it helps you.

Thank you so much for such a detailed response. I didn't think about the integer encoding, I will definitely look into LZ4 encoding! Thanks!

stephenberry / glaze

Thoughts on untagged binary format #687