Add note on why we need to use dill instead of pickle

We needed to use the dill package instead of builtin pickle because pickle does not allow for recursive serialization of classes, so any methods that the user redefines will reference the new version, not the version belonging to the object at serialization time. To understand this problem, see test_serialized_data_processor_uses_original_methods() in test_serialization.py. If dill is switched to pickle, the test fails because the object loaded from the serialized representation uses the redefined methods.

The reason that we need serialization to capture all method implementations is because a user may redefine a class in a new version of a project, which will make it impossible to know which version of the data processor class was used to produce the versioned dataset. Saving the commit is not sufficient to know this because the changes could be uncommitted.

Consider the following scenario. The user defines a data processor subclass and produces a versioned dataset. Later, the user decides that the versioned dataset should use a different representation, and changes the data processor. If the original data processor is loaded and it doesn't also serialize its methods (recursive serialization), then it will use the redefined methods and the serialized data processor will not be able to transform new data to match the its versioned dataset representation. Unless the user knows the exact version of the data processor that corresponded to the versioned dataset--and this version is not necessarily tied to any commit--it is impossible to perform prediction on new data.

kostaleonard / mlops

Add note on why we need to use dill instead of pickle #68