Closed thalassemia closed 2 years ago
Previously, pickling and unpickling of processes could get quite convoluted and messy. During pickling, we saved the parameters
instance variable (e.g. self.parameters
). To unpickle, we called the __init__
method on this saved parameters
dictionary. This requires users to be very conscientious of the values contained within self.parameters
, can be slow (depending on process size and complexity of its __init__
), and causes unexpected results (like the bug noted above where process schemas are sometimes lost).
In 6c9a660, I reverted our custom process serialization code and was able to achieve a substantial performance improvement while rectifying a memory leak. To date, I have still not been able to find the exact source of this leak while running using this custom serialization code but can confirm that removing said code fixes the issue entirely.
I've addressed the review comments and added a more informative error message to help users diagnose serialization issues.
BSON has a hidden maximum serialized size of 2GB (refer to this). In this PR, we switch to using the
orjson
package, which has no size limits and is generally faster than BSON anyways. Orjson also has the benefit of natively supporting Numpy arrays and types.An important caveat is
np.str_
. Orjson will complain if dictionary keys are of the typenp.str_
or if the data to serialize contains arrays withnp.str_
values. The latter case is handled by a fallback Numpy array serializer, but users must manually ensure that all dictionary keys are Python strings and not Numpy strings.Additionally, this PR makes some tweaks to multiprocessing:
ParallelProcesses
no longer keep a reference to the originalProcess
instance after initializing a separate OS process. This reduces RAM usage.By creating this pull request, I agree to the Contributor License Agreement, which is available in
CLA.md
at the top level of this repository.