mtiller / recon

Web and network friendly simulation data formats
MIT License
8 stars 4 forks source link

Investigate Msgpack #11

Closed xogeny closed 10 years ago

xogeny commented 10 years ago

Based on the results in #9, I recognize that BSON is actually very inefficient for arrays.

I researched this and looked at BJSON, UBJSON, Protocol Buffers, Thrift and Smile before finally deciding that the best supported and most compact format (across Java, C and Python) appears to be msgpack.

So I'm going to investigate this by refactoring the current code to have modular serialization/deserialization capabilities for some side by side comparisons.

xogeny commented 10 years ago

One consequence of msgpack's compactness (which I had not anticipated) is that it impacts header size (which needs to be fixed size). It looks like it is capitalizing on short ints when present. Hopefully I can work around this by using long types everywhere and forcing it store them as such...

xogeny commented 10 years ago

So far, it looks like I cannot suppress this optimization. Hmmm...

xogeny commented 10 years ago

OK, optimization issue resolved by storing place holders in the index that represented a worst case scenario (I just stored 4 bytes as a pure byte string).

xogeny commented 10 years ago

Looking at msgpack storage of arrays of doubles, the storage requirement is 3+9_n (where n is the number of doubles). Compared to a raw format storage (8_n), we ultimately lose about 12.5% for large n. We'll just have to live with that.

We gain a format that is pretty much universally readable and one that requires at most 4 reads (without caching) to extract data for a given signal in a given table. I guess we'll have to live with that.

So I think msgpack is the way to go here. I'll merge this shortly.