vivarium-collective / vivarium-core

Core Interface and Engine for Vivarium
https://vivarium-core.readthedocs.io/
Apache License 2.0
25 stars 2 forks source link

Optimize Serialization Using BSON C Extensions #204

Closed thalassemia closed 2 years ago

thalassemia commented 2 years ago

Rationale

Serialization of emitter data can make up a large portion of simulation runtimes, according to profiling data.

Proposal

PyMongo implements its serialization logic in C and has the ability to accept custom encoders via the TypeCodec interface. Here, I modify the Serializer class to instantiate and return TypeCodec instances for serialization.

Results

When no document splitting is required, this PR nearly halves simulation time (running a simulation with a relatively large and complex emit). When document splitting is required at every timestep, the performance improvement drops to about 20%.

Drawbacks

By creating this pull request, I agree to the Contributor License Agreement, which is available in CLA.md at the top level of this repository.

thalassemia commented 2 years ago

Ultimately, I decided on Solution 2 and updated the documentation to reflect the new API. The issues of TypeCodec not supporting custom serialization of built-ins or different serializers for homotypic objects can be sidestepped by manually defining serializers for individual stores using the _serializer key. These stores will, however, lose the performance advantage of using the BSON C extensions. The issue of slow document splitting is unavoidable, and, in any case, this solution is still almost always significantly faster than the current code. To summarize, the remaining drawbacks are:

  1. All dictionary keys must be strings or subclasses thereof.
  2. Type matching is exact, meaning unique subclasses require their own codecs.
U8NWXD commented 2 years ago

To summarize, the remaining drawbacks are:

1. All dictionary keys must be strings or subclasses thereof.

2. Type matching is exact, meaning unique subclasses require their own codecs.

I think these downsides are pretty minor compared to the performance benefits this change will bring. We were already converting dictionary keys to strings prior to emitting them, so imposing downside (1) makes the data more consistent between emits and the Stores. Violating downside (2) currently throws warnings because it's less efficient, so we're already discouraging that behavior. I also can't think of any situations where you'd need to have different serializers for instances of the same type. If you did somehow need this functionality, you could just create a serializer for the type and have that serializer handle the two kinds of objects of that type differently. So overall, I'm in favor of this approach

U8NWXD commented 2 years ago

Also the type checks are failing

thalassemia commented 2 years ago

I looked over the save state code, and we do not call any of the serialization routines in vivarium-core when creating the save state JSON file. Running from a saved state appeared to cause issues before, forcing Matt to add those lines in the self.serializer if block. It seems to work just fine now though.