Open sehoffmann opened 1 year ago
Here are a few things in my mind to help users easily find this problem:
weakref
to wrap a DataPipe
to prevent it presented in the graph.to_graph
only visualizes the graph in main process. We need to add a utility function to visualize the whole graph including the DataPipe
in worker process with DLv2
with MPRS
.
🐛 Describe the bug
This took me extremely long to figure out and is a super sneaky bug:
This pipe wants to do the following:
sharding_round_robin_dispatch
non_replicable()
onwardsOutput:
As you can see, the first case works as expected, but in the second case the pipeline is completely sequentialized. In fact, it fully runs in the dispatch process of the MPRS.
Now, after having finally figured our the source of this bug, I believe that I understand why it behaves as it does. Keeping the reference to
src
causes the whole graph and pipeline to become non-replicable.I'm aware of the technical limitations, i.e. that torchdata uses pickle and object attributes to figure out the graph, and does in fact not take into consideration what part of it is actually used in
__iter__
. However, at least from my perspective, this is a very severe issue. This behavior is extremely unexpected and its very easy to unconsciously cause it. As you can take from my description so far, it also took me a considerable time to even figure out the source of it. Moreover, it's hard to even recognize in the first place; in my specific case i only noticed it because a serialization operation failed due to a tensor being on the GPU (an operation I do at the very end of my pipeline). Should I have not activated and tested that particular part of the pipeline, this issue likely would have eluded me completely until I did some benchmark which would have revealed vastly reduced performance.To make it short: I believe this definitely needs addressing in one form or another. The least one has to do is to put a very big disclaimer into the docs. Of course, it would be even better if this issue would not occur in the first place without any intervention of the user. Should this not be possible, we definitely need a wrapper class a kin to a "WeakReference" that allows us to mark an
IterDataPipe
attribute of an object as "not part of the graph". I.e. something like:What I really find concerning is that this can easily go unnoticed. Due to this, I would even suggest considering adding a
get_parents()
function that must explicitely list which pipes are parents. The behavior implemented by torchdata could then look like this:get_parents()
function, use the output of theget_parents()
functionget_parents()
function, throw an error.Versions
https://github.com/pytorch/data/commit/e78ab6c9ec94f05f0a350ced7fe571f6863c20ec