pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.13k stars 151 forks source link

Make MPRS reuseable #1102

Open sehoffmann opened 1 year ago

sehoffmann commented 1 year ago

🚀 The feature

Sorry if this is already possible, but I didn't found anything about this in the documentation. Can the same MPRS, i.e. also the same set of worker processes, be reused in multiple pipelines?

Motivation, pitch

Have the same set of worker processes for both train and eval datasets.

Alternatives

No response

Additional context

No response

ejguan commented 1 year ago

Yeah, the same RS can be reused multiple times as DataLoader2gets a copy of RS during initialization

sehoffmann commented 1 year ago

Hey @ejguan , thanks a lot for your reply! But would that also reuse the same set of worker processes if the RS gets copied?

ejguan commented 1 year ago

But would that also reuse the same set of worker processes if the RS gets copied?

Sorry this is not possible and I don't know if that's a good design. If you want to reuse worker processes, it means MPRS has to support change inner DataPipe graph during iteration, which makes it over complicated.

Wondering about why do you want to share the same set of worker processes?

sehoffmann commented 1 year ago

In a usual DL training loop you would alternate between iterating over the training and evaluation dataset. So usually one would not iterate them concurrently. It always bugged me to have two sets of processes for this if one would be totally sufficient.

The API design choice to have separate RS instances that can be passed to potentially multiple Dataloader2 instances seemed liked it explicitly addressed this ownership problem (who owns worker processes, the RS, and who consumes and reuses them, the dataloader/pipe).

It's not something super important and I can completely understand your reasoning.

I believe implementation-wise this isn't too complicated though (but not trivial either ofc):

Alternatively, have one main queue per worker process and then, on worker process side, delegate the messages to the actual pipes. Essentially the good ol' reactor pattern.

ejguan commented 1 year ago

In a usual DL training loop you would alternate between iterating over the training and evaluation dataset. So usually one would not iterate them concurrently. It always bugged me to have two sets of processes for this if one would be totally sufficient.

I understand we won't iterate them concurrently. When we enter evaluation stage, the worker processes for training are not heavily executed (just keep sleep until the request for new epoch is sent. I won't be worry about the perf.

I believe implementation-wise this isn't too complicated though (but not trivial either ofc):

I think there are more corner cases and caveats to be covered if we want to support it:

Due to limited bandwidth, if more users have requested the same feature, we might consider to investigate this feature. Alternatively, you can always inherit from MPRS and extend it with this logic.

sehoffmann commented 1 year ago

Yes this sounds reasonable!