Open kyllingstad opened 8 months ago
Suggestions for suitable storage formats are welcome. Note that according to the FMI spec, each subsimulator does its own serialisation and deserialisation, and all the co-simulator sees are binary blobs. So the format needs to support storage of arbitrary binary data.
@davidhjp01, you asked in another issue discussion whether you should start working on this issue. But as noted in the issue description, this depends on #768, which is a work currently in progress, so there is a limit to how much can be done on this yet.
It might be good to start looking into suitable file formats for the saved state, though. We need some format which can store the contents of a cosim::serialization::node
, i.e., a hierarchical data structure with both numerical, textual, and binary data types (see node_data
for a list of the types).
Personally, I would prefer something which is lightweight both in terms of features, complexity, and additional dependencies, but efficiency is also a factor. I guess we can discuss where the perfect trade-off lies when we have some alternatives on the table.
Once we've decided on a storage format, it is also possible to write the functions to save/load a generic cosim::serialization::node
to/from a file even if #768 is not completely done yet.
AI generated list 😅:
Format | Efficiency | Memory Usage | Ease of Use | Library Size |
---|---|---|---|---|
Protocol Buffers (Protobuf) | Highly efficient in terms of serialization and deserialization speed | Designed to be memory-efficient, especially with arena allocation | Requires defining a schema using proto files, which can be a learning curve | Relatively lightweight, no built-in compression |
MessagePack | Known for high efficiency, providing fast serialization and deserialization | Memory-efficient, reduces size of serialized data significantly compared to JSON | Easy to use and integrates well with various programming languages | Small and compact, suitable for limited resource environments |
HDF5 (Hierarchical Data Format) | Highly efficient for handling large datasets, supports parallel I/O operations | Designed to manage large amounts of data efficiently, though files can be large | Steeper learning curve due to complexity, offers extensive features | Relatively large due to comprehensive feature set |
CBOR (Concise Binary Object Representation) | Efficient in terms of serialization speed and compactness of data | Designed to be memory-efficient, suitable for devices with limited resources | Easy to use, does not require a schema, similar to JSON | Small and lightweight |
Avro | Efficient for serialization and deserialization, especially in big data environments | Space-efficient, does not store field type information with each field | Requires defining a schema in JSON, which adds an extra step | Moderately sized, balancing features and performance |
BSON (Binary JSON) | Efficient for storage and scan-speed, though less efficient than JSON in some cases | Uses more memory than JSON due to length prefixes and explicit array indices | Easy to use, especially for those familiar with JSON | Relatively small, integrates well with MongoDB |
FlatBuffers | Designed for maximum memory efficiency, allows direct access to serialized data | Highly memory-efficient, requires minimal allocations | More complex to use due to schema definition and direct memory access | Small and optimized for performance |
Nice summary. Without having spent a lot of time thinking about this, I immediately lean towards the simple and efficient schema-less formats, i.e., MessagePack, CBOR, or BSON. I don't have hands-on experience with any of them, but having read a bit about them, I think CBOR looks most promising. It seems to have been designed as an improvement of MessagePack, is an IETF standard (which is good for stability and third-party support), and has multiple C++ implementations.
I do have experience with Protocol Buffers, though, and while it is good in terms of performance and built-in versioning, I think I'd prefer to avoid the extra compilation step and use of machine-generated source code.
I can try some of the options to find out potential candidates :)
This feature is desired in the OptiStress project, where we will need to simulate the same system many times in a loop with parameter variations. It will save a lot of time since we can start each simulation from a “warmed up” state.
Depends on #756 and #768.