Open olegsinavski opened 9 months ago
We rely on pickle
to help us traverse the graph because we don't know which datafield in your custom DataPipe
contains a dependent DataPipe
. It definitely can be optimized by skipping the datafield that is not any of the following types: DataPipe
, list
, tuple
, dict
, set
..
Thank you for the response @ejguan. Why would serializing something allows you to figure out the structure? I don't think it's designed for this purpose. Would just a regular python recursive field introspection work?
In theory, we injected logics into pickle so that only traversal happens (serialization is not really happening). We did have some plan to optimize the performance toward traversal. But, unfortunately, we are not actively supporting this project anymore.
š Describe the bug
Hello, I found that a standard DataLoader takes unreasonably long to construct itself and to load the first batch if there is a filed in a dataset that takes long to pickle (e.g. an in-memory dataset with panda frame and strings).
This happens only if
shuffle
is true and the datapipe is anIterDataPipe
. The dataloader callsapply_shuffle_settings
which in turn callstraverse_dps
, then_list_connected_datapipes
which eventually pickles all object fields in a dataset. I was not able to comprehend why would one need to pickle datafields to build a datapipe graph.Here is a reproduction:
This code prints out:
In my case, I have even slower datapipe that takes 5 minutes to pickle.
A workaround
One can make a datapipe that doesn't take long to pickle (e.g. with lambdas):
In that case, the printout is
which is a 30x speedup in this case.
Versions
cc @SsnL @VitalyFedyunin @ejguan @dzhulgakov