go-sif / sif

Sif is a framework for fast, predictable, general-purpose distributed computing in the map/reduce paradigm.
Apache License 2.0
32 stars 3 forks source link

Reduce shuffle deserialization #22

Closed Ghnuberath closed 3 years ago

Ghnuberath commented 3 years ago

Since Sif's initial implementation did not include disk swapping for partitions, there are two separate code paths for partition serialization within the codebase (one for network transfer and one for disk swapping). This PR unifies those into a single code path, with the added benefit of eliminating a pointless double deserialization which was introduced by the implementation of disk swapping. Disk swapped partitions would previously be deserialized from disk, then serialized for network transfer, then deserialized on the destination worker. Now, a disk swapped partition will be transferred as-is without being deserialized on the originating worker.

This should be a major performance improvement for any real-world workload, though might not have much impact on small unit tests which do not really stress the disk serialization substantially.

A small API-breaking change is present in this PR. Collect() now takes an int32 instead of an int64.