Open alexrenz opened 2 years ago
Have you tried to run both set_epoch
then reset
?
https://github.com/pytorch/pytorch/blob/c371542efc31b1abfe6f388042aa3ab0cef935f2/torch/csrc/api/src/data/samplers/distributed.cpp#L48-L50
Yes, I have tried that. (I have tried resetting both before and after set_epoch
)
Will take a look. It seems weird.
This kept bugging me, so I also had a look and I think I found the problem: make_data_loader
(https://github.com/pytorch/pytorch/blob/4cb534f92ef6f5b2ec99109b0329f93a859ae831/torch/csrc/api/include/torch/data/dataloader.h#L28) and the StatelessDataLoader
constructor (https://github.com/pytorch/pytorch/blob/b2e79ed5ecabcf4be299dc2ed085223ab5c22fd7/torch/csrc/api/include/torch/data/dataloader/stateless.h#L43) move the passed sampler into the data loader. After that, the sampler
object in the application code has no connection to the sampler that the dataloader holds (and is in unspecified state). Calls to sampler.set_epoch()
and sampler.reset()
(and sampler.epoch()
) go to the moved-from object. The data loader, in contrast, calls reset
on the moved object when it starts a new epoch, such that it sees only the intial value for epoch_
.
To thest this, I tried a quick&dirty change in my libtorch header files: passing the sampler into the data loader by reference rather than by moving it (see https://github.com/alexrenz/pytorch/commit/c5d204ce3e1465eea838c218611df5153e0b8cbd). This seems to solve the problem for me:
====== epoch 0
batch 0: 0 2
batch 1: 1 5
====== epoch 1
batch 0: 1 0
batch 1: 2 5
====== epoch 2
batch 0: 5 0
batch 1: 3 1
Thanks for digging it out. I think the main problem is here. https://github.com/pytorch/pytorch/blob/4cb534f92ef6f5b2ec99109b0329f93a859ae831/torch/csrc/api/include/torch/data/dataloader/stateless.h#L40 We don't have move constructor provided.
We are willing to accept a PR to fix that.
Thanks for digging it out. I think the main problem is here.
We don't have move constructor provided.
I am not sure I follow. Do you mean a move constructor for the sampler or the data loader? And how would either solve the problem that method calls on the sampler object held by the application do not affect the state of the sampler held by the data loader?
May bad for saying move constructor (incorrect). I mean the issue is DataLoader would make a copy of Sampler during construction. The r-value provided to DataLoader, but the input type Sampler
would convert it into a copy as the l-value.
We have two solutions for it:
Sampler
inside DataLoader
as a reference type. Do not do std::move
.DataLoader
to set_epoch
I like the second way because it's safer
I agree that the second option seems safer. However, if we change this in libtorch only, I understand there would be API differences between PyTorch and C++: in PyTorch, users would call set_epoch
on the sampler object; in libtorch, they would call set_epoch
on the data loader object. This seems undesirable. Do you suggest to change this in the Python API, too?
Thanks for pointing it out. I don't think we want to change the Python API. If that's the case we have to go from the first solution, IIUC. But, we need to take care if BC breaking happens.
Then we are on the same page here.
I have one question about the implementation: you can create a data loader with make_data_loader(dataset, sampler, options)
and make_data_loader(dataset, options)
. In the second case, the sampler is created on the fly (see https://github.com/alexrenz/pytorch/blob/c5d204ce3e1465eea838c218611df5153e0b8cbd/torch/csrc/api/include/torch/data/dataloader.h#L48) and it makes sense that the data loader holds the newly created sampler object. I.e., there is a case where the data loader should hold a reference to a sampler and one where it should store the actual sampler object. Do you know an elegant way to implement this?
I think providing constructors to accept rvalue reference or lvalue reference would help.
Yep, the constructor side is pretty clear. I am wondering about an elegant way for the storage side. Having one Sampler& sampler
and one Sampler sampler_owned
seems not elegant to me, but might be necessary. That's why I am asking whether you know about an elegant way.
Emmm. I see. It seems that we are encountering the same problem as tensor. We want to make Sampler
object in CPP behave like a pointer (reference) in Python.
I don't know if there is a better way to do so. We need own
for rvalue but borrow
for lvalue.
cc: @VitalyFedyunin
I followed this discussion as it seems to be closely related to save / load RandomSampler
state while training.
The RandomSampler::load
/ RandomSampler::save
cannot be accessed correctly and I suppose this is because of this issue. std::move
did not help.
torch::save()
and torch::load()
are not usable, due to missing <<
and >>
operators.
Thus continuing neural network training, using RandomSampler
is not reproducable at the moment in C++.
🐛 Describe the bug
Hi everyone,
I am trying to use
DistributedRandomSampler
in the C++ frontend. However, it is not behaving as I would expect: it uses the same random order of examples in every epoch (despiteset_epoch()
). The Python equivalent of the C++ code does what I expect. I initially reported this in the PyTorch forums. @ptrblck recommended I open an issue here.Example:
This produces batches with the same random order in every epoch:
I am trying to have different orders in every epoch. I.e., I would expect the output to look like this:
The following Python code produces the output that I expect:
Versions
See the Python environment output below. The C++ output above stems from CPU-only libtorch 1.10.0, downloaded from https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.10.0%2Bcpu.zip.
Cheers Alexander
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @VitalyFedyunin @ejguan @NivekT