Make MPRS reuseable - Githubissues

sehoffmann commented 1 year ago

🚀 The feature

Sorry if this is already possible, but I didn't found anything about this in the documentation. Can the same MPRS, i.e. also the same set of worker processes, be reused in multiple pipelines?

Motivation, pitch

Have the same set of worker processes for both train and eval datasets.

Alternatives

No response

Additional context

No response

ejguan commented 1 year ago

Yeah, the same RS can be reused multiple times as DataLoader2gets a copy of RS during initialization

sehoffmann commented 1 year ago

Hey @ejguan , thanks a lot for your reply! But would that also reuse the same set of worker processes if the RS gets copied?

ejguan commented 1 year ago

But would that also reuse the same set of worker processes if the RS gets copied?

Sorry this is not possible and I don't know if that's a good design. If you want to reuse worker processes, it means MPRS has to support change inner DataPipe graph during iteration, which makes it over complicated.

Wondering about why do you want to share the same set of worker processes?

sehoffmann commented 1 year ago

In a usual DL training loop you would alternate between iterating over the training and evaluation dataset. So usually one would not iterate them concurrently. It always bugged me to have two sets of processes for this if one would be totally sufficient.

The API design choice to have separate RS instances that can be passed to potentially multiple Dataloader2 instances seemed liked it explicitly addressed this ownership problem (who owns worker processes, the RS, and who consumes and reuses them, the dataloader/pipe).

It's not something super important and I can completely understand your reasoning.

I believe implementation-wise this isn't too complicated though (but not trivial either ofc):

Keep the current impl of one send,recv queue pair for each pipe
Make separate pipes run in separate threads

Alternatively, have one main queue per worker process and then, on worker process side, delegate the messages to the actual pipes. Essentially the good ol' reactor pattern.

ejguan commented 1 year ago

In a usual DL training loop you would alternate between iterating over the training and evaluation dataset. So usually one would not iterate them concurrently. It always bugged me to have two sets of processes for this if one would be totally sufficient.

I understand we won't iterate them concurrently. When we enter evaluation stage, the worker processes for training are not heavily executed (just keep sleep until the request for new epoch is sent. I won't be worry about the perf.

I believe implementation-wise this isn't too complicated though (but not trivial either ofc):

I think there are more corner cases and caveats to be covered if we want to support it:

IIRC, there are use cases that running evaluation step in the middle of training step (e.g. doing evaluation per 1k iterations). In that case, we need to be able to serialize the buffer and intermediate state of the DataPipe graph, then switch to the different pipeline.
There are more things need to be taken care of like random state

Due to limited bandwidth, if more users have requested the same feature, we might consider to investigate this feature. Alternatively, you can always inherit from MPRS and extend it with this logic.

sehoffmann commented 1 year ago

Yes this sounds reasonable!

pytorch / data

Make MPRS reuseable #1102

🚀 The feature

Motivation, pitch

Alternatives

Additional context