pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
84.73k stars 22.82k forks source link

libtorch: `DistributedRandomSampler` uses the same random order in every epoch #73141

Open alexrenz opened 2 years ago

alexrenz commented 2 years ago

🐛 Describe the bug

Hi everyone,

I am trying to use DistributedRandomSampler in the C++ frontend. However, it is not behaving as I would expect: it uses the same random order of examples in every epoch (despite set_epoch()). The Python equivalent of the C++ code does what I expect. I initially reported this in the PyTorch forums. @ptrblck recommended I open an issue here.

Example:

#include <torch/torch.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
  int N = 8;
  auto inputs = torch::arange(N).view({N, 1});
  auto dataset = torch::data::datasets::TensorDataset({inputs});

  // this works as expected: new random order in every epoch
  // torch::data::samplers::RandomSampler sampler (dataset.size().value());

  // this does not: same random order in every epoch
  torch::data::samplers::DistributedRandomSampler sampler (dataset.size().value(), /*num_replicas=*/2, /*rank=*/0);

  auto loader = torch::data::make_data_loader(
                  dataset,
                  sampler,
                  torch::data::DataLoaderOptions().batch_size(2));

  for (unsigned int epoch=0; epoch!=3; ++epoch) {
    std::cout << "====== epoch " << epoch << "\n";

    sampler.set_epoch(epoch);
    // sampler.reset(); // also tried to manually reset the sampler here, but this did not help

    unsigned long batch_idx = 0;
    for (auto& batch : *loader) {
      std::cout << "batch " << batch_idx << ": ";
      for (auto& example : batch) {
        std::cout << example.data[0].item<float>() << " ";
      }
      std::cout << "\n";
      ++batch_idx;
    }
  }

  return 0;
}

This produces batches with the same random order in every epoch:

====== epoch 0
batch 0: 0 2 
batch 1: 1 5 
====== epoch 1
batch 0: 0 2 
batch 1: 1 5 
====== epoch 2
batch 0: 0 2 
batch 1: 1 5 

I am trying to have different orders in every epoch. I.e., I would expect the output to look like this:

====== epoch 0
batch 0: 4 7
batch 1: 2 1
====== epoch 1
batch 0: 5 2
batch 1: 7 1
====== epoch 2
batch 0: 0 7
batch 1: 6 1

The following Python code produces the output that I expect:

import torch

N = 8
inputs = torch.arange(N, dtype=torch.float32).view(N, 1)
dataset = torch.utils.data.TensorDataset(inputs)

sampler = torch.utils.data.DistributedSampler(dataset, num_replicas=2, rank=0, shuffle=True)
loader = torch.utils.data.DataLoader(dataset, batch_size=2, sampler=sampler)

for epoch in range(3):
    sampler.set_epoch(epoch)
    print("====== epoch " + str(epoch))
    for batch_idx, (input,) in enumerate(loader):
        print("batch " + str(batch_idx) + ": " + " ".join([str(int(x)) for x in input.squeeze().tolist()]))

Versions

See the Python environment output below. The C++ output above stems from CPU-only libtorch 1.10.0, downloaded from https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcpu.zip.

Collecting environment information...
PyTorch version: 1.10.2+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 26 2021, 20:14:08)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-100-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.6.55
GPU models and configuration: GPU 0: NVIDIA GeForce 940MX
Nvidia driver version: 510.47.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudnn.so.6.0.21
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] botorch==0.3.1
[pip3] deepctr-torch==0.2.7
[pip3] gpytorch==1.2.0
[pip3] numpy==1.19.5
[pip3] torch==1.10.2
[pip3] torchaudio==0.10.2
[pip3] torchvision==0.11.3
[pip3] torchviz==0.0.1
[conda] Could not collect

Cheers Alexander

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @VitalyFedyunin @ejguan @NivekT

ejguan commented 2 years ago

Have you tried to run both set_epoch then reset? https://github.com/pytorch/pytorch/blob/c371542efc31b1abfe6f388042aa3ab0cef935f2/torch/csrc/api/src/data/samplers/distributed.cpp#L48-L50

alexrenz commented 2 years ago

Yes, I have tried that. (I have tried resetting both before and after set_epoch)

ejguan commented 2 years ago

Will take a look. It seems weird.

alexrenz commented 2 years ago

This kept bugging me, so I also had a look and I think I found the problem: make_data_loader (https://github.com/pytorch/pytorch/blob/4cb534f92ef6f5b2ec99109b0329f93a859ae831/torch/csrc/api/include/torch/data/dataloader.h#L28) and the StatelessDataLoader constructor (https://github.com/pytorch/pytorch/blob/b2e79ed5ecabcf4be299dc2ed085223ab5c22fd7/torch/csrc/api/include/torch/data/dataloader/stateless.h#L43) move the passed sampler into the data loader. After that, the sampler object in the application code has no connection to the sampler that the dataloader holds (and is in unspecified state). Calls to sampler.set_epoch() and sampler.reset() (and sampler.epoch()) go to the moved-from object. The data loader, in contrast, calls reset on the moved object when it starts a new epoch, such that it sees only the intial value for epoch_.

To thest this, I tried a quick&dirty change in my libtorch header files: passing the sampler into the data loader by reference rather than by moving it (see https://github.com/alexrenz/pytorch/commit/c5d204ce3e1465eea838c218611df5153e0b8cbd). This seems to solve the problem for me:

====== epoch 0
batch 0: 0 2 
batch 1: 1 5 
====== epoch 1
batch 0: 1 0 
batch 1: 2 5 
====== epoch 2
batch 0: 5 0 
batch 1: 3 1 
ejguan commented 2 years ago

Thanks for digging it out. I think the main problem is here. https://github.com/pytorch/pytorch/blob/4cb534f92ef6f5b2ec99109b0329f93a859ae831/torch/csrc/api/include/torch/data/dataloader/stateless.h#L40 We don't have move constructor provided.

ejguan commented 2 years ago

We are willing to accept a PR to fix that.

alexrenz commented 2 years ago

Thanks for digging it out. I think the main problem is here.

https://github.com/pytorch/pytorch/blob/4cb534f92ef6f5b2ec99109b0329f93a859ae831/torch/csrc/api/include/torch/data/dataloader/stateless.h#L40

We don't have move constructor provided.

I am not sure I follow. Do you mean a move constructor for the sampler or the data loader? And how would either solve the problem that method calls on the sampler object held by the application do not affect the state of the sampler held by the data loader?

ejguan commented 2 years ago

May bad for saying move constructor (incorrect). I mean the issue is DataLoader would make a copy of Sampler during construction. The r-value provided to DataLoader, but the input type Sampler would convert it into a copy as the l-value. We have two solutions for it:

I like the second way because it's safer

alexrenz commented 2 years ago

I agree that the second option seems safer. However, if we change this in libtorch only, I understand there would be API differences between PyTorch and C++: in PyTorch, users would call set_epoch on the sampler object; in libtorch, they would call set_epoch on the data loader object. This seems undesirable. Do you suggest to change this in the Python API, too?

ejguan commented 2 years ago

Thanks for pointing it out. I don't think we want to change the Python API. If that's the case we have to go from the first solution, IIUC. But, we need to take care if BC breaking happens.

alexrenz commented 2 years ago

Then we are on the same page here.

I have one question about the implementation: you can create a data loader with make_data_loader(dataset, sampler, options) and make_data_loader(dataset, options). In the second case, the sampler is created on the fly (see https://github.com/alexrenz/pytorch/blob/c5d204ce3e1465eea838c218611df5153e0b8cbd/torch/csrc/api/include/torch/data/dataloader.h#L48) and it makes sense that the data loader holds the newly created sampler object. I.e., there is a case where the data loader should hold a reference to a sampler and one where it should store the actual sampler object. Do you know an elegant way to implement this?

ejguan commented 2 years ago

I think providing constructors to accept rvalue reference or lvalue reference would help.

alexrenz commented 2 years ago

Yep, the constructor side is pretty clear. I am wondering about an elegant way for the storage side. Having one Sampler& sampler and one Sampler sampler_owned seems not elegant to me, but might be necessary. That's why I am asking whether you know about an elegant way.

ejguan commented 2 years ago

Emmm. I see. It seems that we are encountering the same problem as tensor. We want to make Sampler object in CPP behave like a pointer (reference) in Python. I don't know if there is a better way to do so. We need own for rvalue but borrow for lvalue. cc: @VitalyFedyunin

ralfriegel5 commented 2 years ago

I followed this discussion as it seems to be closely related to save / load RandomSampler state while training.

The RandomSampler::load / RandomSampler::save cannot be accessed correctly and I suppose this is because of this issue. std::move did not help.

torch::save() and torch::load() are not usable, due to missing << and >> operators.

Thus continuing neural network training, using RandomSampler is not reproducable at the moment in C++.