Open kapilkd13 opened 3 years ago
That's an interesting question. Do you know how these sparse matrices are read by pyarrow? My guess would be that they are read as coordinate vector and data vector? I would assume that these could be wired into petastorm somehow. I would recommend not making them into dense matrices within petastorm if possible, since some applications would not be able to afford the memory/time overhead of densifying the matrices.
If you are feeling adventurous trying taking a stab at this petastorm extension, I would be happy to provide the support you'd need.
Hi, I have a derived dataframe in spark which has few vector columns with high sparsity. Currently I am storing them as
SparseVector
and saving as a parquet File. I tried reading this parquet file inside a sagemaker notebook but aws sagemaker notebooks do not support pyspark3 right now and I gotRuntimeError: Vector columns are only supported in pyspark>=3.0
Now is there a way to encode this column and read it back on client side(pytorch) and converting it to dense vector at client side before passing for training. Basically idea is to avoid converting column to denseVector, to save space on disk and network transfer time. Is there a sparse vector format that Petastorm support and if not, is it possible to add this feature. This seems like a highly usable feature. I am willing to contribute if needed.My last resort would be to split a sparse vector in three columns namely size(scaler), keys and values and save them to parquet as np dtypes and then combine them in transformSpec function into a dense vector(np array). But this doesn't feel optimal approach. Any suggestions?