pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.12k stars 149 forks source link

fix an edge case bug where state is incomplete if request checkpoint after resume #1251

Closed andrewkho closed 4 months ago

andrewkho commented 4 months ago

StatefulDataLoader may return incomplete checkpoints if a state_dict is requested soon after a resume.

Consider a multiprocess dataloader with num_workers = 4. When we resume this dataloader, assuming we make it through 4 batches before the next state_dict is requested, we should have fresh snapshots for all 4 workers again. However in the case where we only make it through say 2, then we are missing state for the other 2 workers.

Test plan: Wrote a new unit-test to capture this, which fails with below error before the fix:

        worker_states = [None] * self._num_workers
        if next_iter_state is not None:
            assert (
                self._SNAPSHOT in next_iter_state
            ), f"State doesn't contain key '{self._SNAPSHOT}' expected for multiprocess dataloader"
            wstates = next_iter_state[self._SNAPSHOT].get(self._WORKER_SNAPSHOTS, {})
>           assert set(range(len(wstates))) == set(wstates.keys()), (len(wstates), wstates.keys())
E           AssertionError: (2, dict_keys([3, 0]))

Changes

andrewkho commented 4 months ago

The CI did run, it's visible on the checks page. Not sure if we can fix this, the facebook-github-bot CLA Signed message seems to kick off a new set of actions, showing the tests as skipped.

For the tests themselves, MacOS still failing during post-test cleanup so will ignore that for now as we have an open issue https://github.com/actions/setup-python/issues/857

facebook-github-bot commented 4 months ago

@andrewkho has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot commented 4 months ago

@andrewkho merged this pull request in pytorch/data@1417368c9ff946849e33ab45600b0fe692536464.