uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Any method to encode a spark's SparseVector and pass it to petastorm #649

Open kapilkd13 opened 3 years ago

kapilkd13 commented 3 years ago

Hi, I have a derived dataframe in spark which has few vector columns with high sparsity. Currently I am storing them as SparseVector and saving as a parquet File. I tried reading this parquet file inside a sagemaker notebook but aws sagemaker notebooks do not support pyspark3 right now and I got RuntimeError: Vector columns are only supported in pyspark>=3.0 Now is there a way to encode this column and read it back on client side(pytorch) and converting it to dense vector at client side before passing for training. Basically idea is to avoid converting column to denseVector, to save space on disk and network transfer time. Is there a sparse vector format that Petastorm support and if not, is it possible to add this feature. This seems like a highly usable feature. I am willing to contribute if needed.

My last resort would be to split a sparse vector in three columns namely size(scaler), keys and values and save them to parquet as np dtypes and then combine them in transformSpec function into a dense vector(np array). But this doesn't feel optimal approach. Any suggestions?

selitvin commented 3 years ago

That's an interesting question. Do you know how these sparse matrices are read by pyarrow? My guess would be that they are read as coordinate vector and data vector? I would assume that these could be wired into petastorm somehow. I would recommend not making them into dense matrices within petastorm if possible, since some applications would not be able to afford the memory/time overhead of densifying the matrices.

If you are feeling adventurous trying taking a stab at this petastorm extension, I would be happy to provide the support you'd need.