datapipe serialization support / cloudpickle / parallel support

d4l3k commented 2 years ago

I've been looking at how we might go about supporting torchdata within TorchX and with components. I was wondering what the serialization options were for transforms and what that might look like.

There's a couple of common patterns that would be nice to support:

general data transforms (with potentially distributed preprocessing via torch elastic/ddp)
data splitting into train/validation sets
summary statistic computation

For the general transforms and handling arbitrary user data we were wondering how we might go about serializing the data pipes and transforms for use in a pipeline with TorchX.

There's a couple of options here:

add serialization support to the transforms so you can serialize them (lambdas?)
generate a .py file from a provided user function
pickle the transform using something like cloudpickle/torch.package and load it in a trainer app
ask the user to write a .py file that uses the datapipes as the transform and create a TorchX component (what we currently have)

Has there been any thought about how to support this well? Is there extra work that should be done here to make this better?

Are DataPipes guaranteed to be pickle safe and is there anything that needs to be done to support that?

I was also wondering if there's multiprocessing based datapipes and how that works since this seems comparable. I did see https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py but didn't see any examples on how to use that to achieve a traditional PyTorch dataloader style workers.

P.S. should this be on the pytorch discussion forums instead? it's half feature request half questions so wasn't sure where best to put it

cc @kiukchung

ejguan commented 2 years ago

general data transforms

I am not hundred percent sure about it. I would say we would guarantee DataPipe graph (pipeline) is going to be serializable with user provided function. Our current way is to pickle lambda function using dill.

data splitting into train/validation sets

We have utility DataPipe provided to users to split data into two separate pipelines. This may not be related, but I want to let you know. We would provide dynamic sharding for users, which means users don't need to hardcode sharding setting in their DataSet.

summary statistic computation

We currently have a way to retrieve a graph of data pipeline. But, better visualization is not done yet. https://github.com/pytorch/pytorch/blob/3202028ed1ca24c91dc7192ef69b305690db7abc/torch/utils/data/graph.py#L54

Are DataPipes guaranteed to be pickle safe and is there anything that needs to be done to support that?

Our provided DataPipes would be guaranteed to be serializable. And, we can't guarantee the users' implementation of DataPipes. But, if users choose to use DataLoader2 with their datapipes, they would get notification about if their DataPipe is serializable or not.

I was also wondering if there's multiprocessing based datapipes and how that works since this seems comparable

We would provide multiprocessing. The functionality is in-place, but we are still working with internal teams to align the API of DataLoaderV2.

should this be on the pytorch discussion forums instead?

I don't think this is a right timing as we are not officially released. And, the RFC is tracked in PyTorch Core not in this repo.

ejguan commented 2 years ago

cc: @VitalyFedyunin to see if you want to supply other comments.

kiukchung commented 1 year ago

@ejguan regarding builtin datapipes being pickle-safe... is this the way you'd recommend folks implement checkpointing for datapipes?

ejguan commented 1 year ago

regarding builtin datapipes being pickle-safe

IIRC, it's a requirement for both multiprocessing and checkpointing. As @NivekT is working on checkpointing, feel free to chime in

NivekT commented 1 year ago

Yes, though you can write custom __getstate__ and __setstate__ methods to accomplish that.

kiukchung commented 1 year ago

IIUC when num_workers > 1 the DataPipes are iterated on the dataloader worker (child process). Therefore, the "state" of the datapipe will be resident on the child proc not the main parent (where the trainer loop will run). How exactly does one get the pickled state of the datapipe from the child process back to the parent for checkpointing?

NivekT commented 1 year ago

Good question! The plan is to use PrototypeMultiprocessingReadingService to pass request/response messages, where the response will be the pickled state of the DataPipe

pytorch / data

datapipe serialization support / cloudpickle / parallel support #113