Open tadas-subonis opened 4 years ago
That's an interesting scenario. Petastorm's data reading mechanism is designed to work without any dependency on PySpark. As such, there is no way to access data directly from spark rdds. We do provide make_spark_converter
API that helps automatically converting spark data into pytorch/TF consumable types, but I am not sure if this is what you need.
@selitvin Yeah. That's not exactly the thing, but now I am using this workaround:
This far from ideal, but it seems to work :).
The memory-based (data is in memory) approach would give me 109 iterations/s (the perfect case I guess).
Using toLocalIterator() and my custom DataLoader I can get something between 8it/s to 30it/s (depending on how many workers I decide to use). However, if I use too many workers, it generates lots of spark requests and that doesn't look too good.
Finally, with this conversion and Petastorm I can get something around 90it/s
I think I understand your setup a little better now.
make_spark_converter
is a syntactic sugar that simplifies working with standard parquet file by saving it as a parquet in a temporary storage and then creating a pytorch-dataloader / tf-dataset directly on top of it.
However, petastorm can taking care of numpy serialization/deserialization on top of a parquet file directly.
Do do so, I would imagine, your could look like:
materialize_dataset
context manager to create a "petastorm" dataset - i.e. parquet dataset with numpy arrays(https://github.com/uber/petastorm/blob/f4685af4882482b7288a3a04bb8cbf4f7a2ffd05/examples/hello_world/petastorm_dataset/generate_petastorm_dataset.py). petastorm.make_reader
to read data from that petastorm-parquet dataset.I am referring to the process shown in this example.
Thanks. I'll give this a go too
Is there a way to create DataLoaders for PyTorch from RDDs directly without doing a conversion to the DataFrame?
This would allow me to pass numpy arrays directly without trying to convert them to a list of lists of lists and so on in case, I have high dimensionality array.