Closed andrewkho closed 4 months ago
The CI did run, it's visible on the checks page. Not sure if we can fix this, the facebook-github-bot CLA Signed message seems to kick off a new set of actions, showing the tests as skipped.
For the tests themselves, MacOS still failing during post-test cleanup so will ignore that for now as we have an open issue https://github.com/actions/setup-python/issues/857
@andrewkho has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@andrewkho merged this pull request in pytorch/data@1417368c9ff946849e33ab45600b0fe692536464.
StatefulDataLoader may return incomplete checkpoints if a state_dict is requested soon after a resume.
Consider a multiprocess dataloader with num_workers = 4. When we resume this dataloader, assuming we make it through 4 batches before the next state_dict is requested, we should have fresh snapshots for all 4 workers again. However in the case where we only make it through say 2, then we are missing state for the other 2 workers.
Test plan: Wrote a new unit-test to capture this, which fails with below error before the fix:
Changes
Fixes edge-case bug where StatefulDataLoader.state_dict() may return an incomplete state after a resume if a request is made before num_workers steps.