Closed thalassemia closed 2 years ago
Ultimately, I decided on Solution 2 and updated the documentation to reflect the new API. The issues of TypeCodec
not supporting custom serialization of built-ins or different serializers for homotypic objects can be sidestepped by manually defining serializers for individual stores using the _serializer
key. These stores will, however, lose the performance advantage of using the BSON C extensions. The issue of slow document splitting is unavoidable, and, in any case, this solution is still almost always significantly faster than the current code. To summarize, the remaining drawbacks are:
To summarize, the remaining drawbacks are:
1. All dictionary keys must be strings or subclasses thereof. 2. Type matching is exact, meaning unique subclasses require their own codecs.
I think these downsides are pretty minor compared to the performance benefits this change will bring. We were already converting dictionary keys to strings prior to emitting them, so imposing downside (1) makes the data more consistent between emits and the Stores. Violating downside (2) currently throws warnings because it's less efficient, so we're already discouraging that behavior. I also can't think of any situations where you'd need to have different serializers for instances of the same type. If you did somehow need this functionality, you could just create a serializer for the type and have that serializer handle the two kinds of objects of that type differently. So overall, I'm in favor of this approach
Also the type checks are failing
I looked over the save state code, and we do not call any of the serialization routines in vivarium-core when creating the save state JSON file. Running from a saved state appeared to cause issues before, forcing Matt to add those lines in the self.serializer
if block. It seems to work just fine now though.
Rationale
Serialization of emitter data can make up a large portion of simulation runtimes, according to profiling data.
Proposal
PyMongo implements its serialization logic in C and has the ability to accept custom encoders via the
TypeCodec
interface. Here, I modify theSerializer
class to instantiate and returnTypeCodec
instances for serialization.Results
When no document splitting is required, this PR nearly halves simulation time (running a simulation with a relatively large and complex emit). When document splitting is required at every timestep, the performance improvement drops to about 20%.
Drawbacks
dict
-->BSON
-->dict
. The final dictionary will contain only BSON-supported types (e.g. built-ins), making the cost of our document splitting algorithm negligible. Interestingly, this round-trip scheme still yields a 40% improvement.RawBSONDocument
instance (which avoids PyMongo's internal serialization logic) and insert. This was implemented in b6b46f6 and ended up being significantly slower than Solution 2 even in the case where documents are split at every timestep.np.str_
). This is in line with the official JSON specifications, but it does break this test.TypeCodec
. This means that eachProcess
,Composer
, etc. requires its ownTypeCodec
as demonstrated here.TypeCodec
does not support custom de/serializers for Python's built-in types.By creating this pull request, I agree to the Contributor License Agreement, which is available in
CLA.md
at the top level of this repository.