Closed kalradivyanshu closed 10 months ago
Great thoughts. However, BEVE is highly concerned with performance. If you were to write all integers as compressed integers you would have a 10X or greater performance loss for large arrays (not being able to easily do memcpy). Also, if you simply use a compression algorithm on your BEVE data then you gain most of the compression benefits and it becomes entirely opt in.
BEVE is designed to be easily compressed, a value of 20 in a uint64_t means that you have 7 consecutive zero bytes. If you often have numbers like this then a compression algorithm will easily handle it.
When it comes to headers, they are necessary if the data is to be written to file and loaded by another program without having a schema. I much prefer schema less formats, as they are much easier to debug and allow files to be archived long term without needed to save matching schema documents. BEVE is also designed to convert directly to/from JSON, so that's another requirement for headers.
I do like the idea of a header-less binary format that focuses on minimizing memory. I think it would be a good addition to Glaze. If you wanted to add this raw binary format to Glaze, I would be happy to merge it in. But, you may find that simply using a compression algorithm would solve your issues.
Couldn't a compromise be if non array compressed integers, if array, sized integers?
I do get the compression argument, i just worry about the performance impact of compressing (my use case is sending a lot of data on udp, so compressing over and over 1.5kb of packets, is not super efficient) i do agree with it can work for storing in files.
Regarding the header-less format, if I were to implement it in glaze, where do I start, can you give me some pointers?
Couldn't a compromise be if non array compressed integers, if array, sized integers?
It actually isn't a good compression mechanism for integers:
uint64_t
and int64_t
, so these types would require another byte to indicate their type.uint8_t
values from 65 - 255 would require an extra byte. So, we don't actually save anything for the majority of uint8_t
values. The same is true for the other integer types, that 75% of the time (2 bits quarters our range) we don't get compression savings. The issue is made worse by the fact that we use power of 2 bytes to store integers. So if we were to store 16384
(2^14) in a uint16_t
we would have to bump the storage integer to a uint32_t
. This is adding 2 bytes to 75% of our uint16_t
values. So, you can see that this is generally a bad compression algorithm for integers and really only makes sense for compressing sizes of arrays and objects. A compression algorithm like LZ4 will usually (statistically) be much more efficient than compressing integers in the manner that you and BEVE have implemented.I do get the compression argument, i just worry about the performance impact of compressing...
I'll note that another argument for compression is that if you have strings and care about size (and network performance), then you probably should be compressing your data. Because compressing strings will significantly save space and therefore transfer time.
High speed compression algorithms will run faster than 500 MB/s, and sending less data over UDP will also improve performance. So, you will likely gain back the compression time by needing to transfer less data. I think LZ4 is probably an excellent choice for your use case.
I would like to add some compression helpers to Glaze, to make it easier to work with BEVE and compression, and at my work I actually have the need for high speed compression as well. So, I'll be working on this in the near future. One thing to note is that if your system would allow two cores for serializing data, then we can actually run the compression algorithm in parallel with the BEVE serialization. This would mean that there would be almost zero overhead to compression, but it would use another thread. I'll write up an issue for this, because it is a feature I would like to have.
Regarding the header-less format, if I were to implement it in glaze, where do I start, can you give me some pointers?
I think the BEVE format works for everything you want, except for headers within structs and tuple-like arrays.
Thanks for getting me to consider this more, because I'm now thinking we don't need to implement a completely new format. Rather, I think we can add BEVE extensions for raw-byte objects and arrays. These wouldn't be schema-less, but would be great for where size is critical. And, I think adding them to a format that is generally schema-less and allows tags is a benefit, because the user can decide how much introspection they want versus message size.
I'll make a performance note as well. That if a C++ struct is_standard_layout (holds trivial types like ints, bool, and floats) then we don't have to iterate over the elements of the struct and can simply memcpy the entire struct. This will provide a significant performance improvement for these kinds of structs and is extra motivation to support this header-less format.
In conclusion, hold off on implementing a header-less format until I've figured out how best to add it to BEVE. In the meantime, I would recommend experimenting with LZ4 and see if it helps you.
Thank you so much for such a detailed response. I didn't think about the integer encoding, I will definitely look into LZ4 encoding! Thanks!
I tried the untagged binary format, and it just works, which is amazing. I have a few questions tho. First, it really depends on how you define the data:
in this example,
a
will take 8 bytes anda1
will take 1 byte, even though both represent the same data. I think this is because BEVE treats uint8_t as 1 byte and uint64_t as 8 bytes, no matter the input content. I honestly think a much better way would be all unsigned integers are compressed integer and signed integers are compressed integer with one extra byte to indicate sign.Also, I dont know if headers are needed? I overcome them in my own code which uses refl-cpp by just having function overrides, so I dont need to know what the data represent, just every part of the structure reads its part, increments the offset for the next part and so on:
And then just iterate over reflections and call
write
orread
:I wrote a small dummy writer using refl-cpp (I am not very fluent with C++20 concepts) and added it to https://github.com/kalradivyanshu/glaze_v2_issue (clone and run ./run_example.sh)
for this struct:
glaze untagged is 121 bytes and writer is 38 bytes. That is a big difference, specially since untagged is meant to be optimized for space. Would love to hear your thoughts (on my code quality too, since I am mid at c++ lol).
Thanks for all your hardwork!