Any method to encode a spark's SparseVector and pass it to petastorm

uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.8k stars 284 forks source link

Hi, I have a derived dataframe in spark which has few vector columns with high sparsity. Currently I am storing them as SparseVector and saving as a parquet File. I tried reading this parquet file inside a sagemaker notebook but aws sagemaker notebooks do not support pyspark3 right now and I got RuntimeError: Vector columns are only supported in pyspark>=3.0 Now is there a way to encode this column and read it back on client side(pytorch) and converting it to dense vector at client side before passing for training. Basically idea is to avoid converting column to denseVector, to save space on disk and network transfer time. Is there a sparse vector format that Petastorm support and if not, is it possible to add this feature. This seems like a highly usable feature. I am willing to contribute if needed.

My last resort would be to split a sparse vector in three columns namely size(scaler), keys and values and save them to parquet as np dtypes and then combine them in transformSpec function into a dense vector(np array). But this doesn't feel optimal approach. Any suggestions?

uber / petastorm

Any method to encode a spark's SparseVector and pass it to petastorm #649