msr-fiddle / pipedream

MIT License
379 stars 117 forks source link

Handling uneven number of batches per replicated instance of a layer #59

Open siddharth9820 opened 4 years ago

siddharth9820 commented 4 years ago

This is in reference to this function in runtime.py

def num_iterations(self, loader_size):
        """ Determines number of iterations for this stage
        TODO: don't currently support uneven configurations.
        """
        if self.stage == 0 or self.stage is None:
            return loader_size

        num_iterations = loader_size * self.num_ranks_in_first_stage
        assert num_iterations % self.num_ranks_in_stage == 0
        num_iterations = num_iterations // self.num_ranks_in_stage

        return num_iterations

From my understanding, the total number of batches in the dataset should be a multiple of the layer replication factor for all layers except the first one for this function to not throw an assertion error. However, there is no guarantee that the optimizer module of pipedream will assign replication factors so that they follow this constraint as well. As a result, sometimes the framework is unable to execute training because of this limitation.