Can it work with RDDs instead of DataFrames?

uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.78k stars 285 forks source link

Can it work with RDDs instead of DataFrames? #608

Open tadas-subonis opened 4 years ago

tadas-subonis commented 4 years ago

Is there a way to create DataLoaders for PyTorch from RDDs directly without doing a conversion to the DataFrame?

This would allow me to pass numpy arrays directly without trying to convert them to a list of lists of lists and so on in case, I have high dimensionality array.

selitvin commented 4 years ago

That's an interesting scenario. Petastorm's data reading mechanism is designed to work without any dependency on PySpark. As such, there is no way to access data directly from spark rdds. We do provide make_spark_converter API that helps automatically converting spark data into pytorch/TF consumable types, but I am not sure if this is what you need.

tadas-subonis commented 4 years ago

@selitvin Yeah. That's not exactly the thing, but now I am using this workaround:

create RDD pipeline
convert all numpy arrays to bytes (.tobytes())
convert RDD -> DataFrame
call make_spark_converter
create transform function that converts bytes inside pandas batches back into original np arrays

This far from ideal, but it seems to work :).

The memory-based (data is in memory) approach would give me 109 iterations/s (the perfect case I guess).

Using toLocalIterator() and my custom DataLoader I can get something between 8it/s to 30it/s (depending on how many workers I decide to use). However, if I use too many workers, it generates lots of spark requests and that doesn't look too good.

Finally, with this conversion and Petastorm I can get something around 90it/s

selitvin commented 4 years ago

I think I understand your setup a little better now. make_spark_converter is a syntactic sugar that simplifies working with standard parquet file by saving it as a parquet in a temporary storage and then creating a pytorch-dataloader / tf-dataset directly on top of it. However, petastorm can taking care of numpy serialization/deserialization on top of a parquet file directly. Do do so, I would imagine, your could look like:

Use materialize_dataset context manager to create a "petastorm" dataset - i.e. parquet dataset with numpy arrays(https://github.com/uber/petastorm/blob/f4685af4882482b7288a3a04bb8cbf4f7a2ffd05/examples/hello_world/petastorm_dataset/generate_petastorm_dataset.py).
Use petastorm.make_reader to read data from that petastorm-parquet dataset.

I am referring to the process shown in this example.

tadas-subonis commented 4 years ago

Thanks. I'll give this a go too