uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

allow NdarrayCodec.decode to pass through ndarrays #428

Closed stedn closed 5 years ago

stedn commented 5 years ago

We have a workflow where we had a dataset that was not encoded with NDArrayCodec, but we were decoding using it, essentially passing an ndarray through directly. Was wondering if it is worthwhile to allow this behavior. It's also possible there is another way to do this that we didn't detect.

CLAassistant commented 5 years ago

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


stedn seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

stedn commented 5 years ago

nevermind, that CLA thing is too onerous for 2 lines of code.

selitvin commented 5 years ago

This is an interesting request. Can you please give a little bit more details about your setup? I am curious, are you reading from a parquet store?

We saw an internal usecase where we use a storage format that is different from Parquet and that is capable of storing numpy arrays natively. We would need a similar functionality of by-passing the decoding. I wonder if your case is somewhat similar?

I was working on #421 (in that PR codec=None can be set only for scalars for now). do you think specifying codec=None for ndarrasy would work for your case as well?