libtorch: `DistributedRandomSampler` uses the same random order in every epoch

alexrenz commented 2 years ago

🐛 Describe the bug

Hi everyone,

I am trying to use DistributedRandomSampler in the C++ frontend. However, it is not behaving as I would expect: it uses the same random order of examples in every epoch (despite set_epoch()). The Python equivalent of the C++ code does what I expect. I initially reported this in the PyTorch forums. @ptrblck recommended I open an issue here.

Example:

#include <torch/torch.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
  int N = 8;
  auto inputs = torch::arange(N).view({N, 1});
  auto dataset = torch::data::datasets::TensorDataset({inputs});

  // this works as expected: new random order in every epoch
  // torch::data::samplers::RandomSampler sampler (dataset.size().value());

  // this does not: same random order in every epoch
  torch::data::samplers::DistributedRandomSampler sampler (dataset.size().value(), /*num_replicas=*/2, /*rank=*/0);

  auto loader = torch::data::make_data_loader(
                  dataset,
                  sampler,
                  torch::data::DataLoaderOptions().batch_size(2));

  for (unsigned int epoch=0; epoch!=3; ++epoch) {
    std::cout << "====== epoch " << epoch << "\n";

    sampler.set_epoch(epoch);
    // sampler.reset(); // also tried to manually reset the sampler here, but this did not help

    unsigned long batch_idx = 0;
    for (auto& batch : *loader) {
      std::cout << "batch " << batch_idx << ": ";
      for (auto& example : batch) {
        std::cout << example.data[0].item<float>() << " ";
      }
      std::cout << "\n";
      ++batch_idx;
    }
  }

  return 0;
}

This produces batches with the same random order in every epoch:

====== epoch 0
batch 0: 0 2 
batch 1: 1 5 
====== epoch 1
batch 0: 0 2 
batch 1: 1 5 
====== epoch 2
batch 0: 0 2 
batch 1: 1 5

I am trying to have different orders in every epoch. I.e., I would expect the output to look like this:

====== epoch 0
batch 0: 4 7
batch 1: 2 1
====== epoch 1
batch 0: 5 2
batch 1: 7 1
====== epoch 2
batch 0: 0 7
batch 1: 6 1

The following Python code produces the output that I expect:

import torch

N = 8
inputs = torch.arange(N, dtype=torch.float32).view(N, 1)
dataset = torch.utils.data.TensorDataset(inputs)

sampler = torch.utils.data.DistributedSampler(dataset, num_replicas=2, rank=0, shuffle=True)
loader = torch.utils.data.DataLoader(dataset, batch_size=2, sampler=sampler)

for epoch in range(3):
    sampler.set_epoch(epoch)
    print("====== epoch " + str(epoch))
    for batch_idx, (input,) in enumerate(loader):
        print("batch " + str(batch_idx) + ": " + " ".join([str(int(x)) for x in input.squeeze().tolist()]))

Versions

See the Python environment output below. The C++ output above stems from CPU-only libtorch 1.10.0, downloaded from https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcpu.zip.

Collecting environment information...
PyTorch version: 1.10.2+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 26 2021, 20:14:08)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-100-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.6.55
GPU models and configuration: GPU 0: NVIDIA GeForce 940MX
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so.6.0.21
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] botorch==0.3.1
[pip3] deepctr-torch==0.2.7
[pip3] gpytorch==1.2.0
[pip3] numpy==1.19.5
[pip3] torch==1.10.2
[pip3] torchaudio==0.10.2
[pip3] torchvision==0.11.3
[pip3] torchviz==0.0.1
[conda] Could not collect

Cheers Alexander

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @VitalyFedyunin @ejguan @NivekT