Support pickle/unpickle for numpy types

icexelloss commented 7 years ago

Hello,

I am a PySpark user. In PySpark, it currently uses Pyrolite to pickle/unpickle data in jvm and send to/from python. Here is where it gets used:

https://github.com/apache/spark/blob/6a5a7254dc37952505989e9e580a14543adb730c/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L25

It is a known issue that this doesn't work with numpy types in python, that is, if the user tries to send a numpy type to the jvm, it throws an exception inside:

https://github.com/irmen/Pyrolite/blob/master/java/src/main/java/net/razorvine/pickle/Unpickler.java#L154

The exception is something like:

Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)
    at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
    at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
    at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
    at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
    at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
    at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1$$anonfun$apply$5.apply(BatchEvalPythonExec.scala:137)
    at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1$$anonfun$apply$5.apply(BatchEvalPythonExec.scala:136)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

I am wondering what's your thought on supporting numpy types?

irmen commented 7 years ago

something related to this is written here https://issues.apache.org/jira/browse/SPARK-12157

Also, you can register custom deserializers. Have you tried that? See net.razorvine.pickle.Unpickler.registerConstructor(...)

My view on supporting Numpy types has been so far to encourage conversion to a built-in type first. This avoids lots of issues http://pyro4.readthedocs.io/en/stable/tipstricks.html#pyro-and-numpy

irmen commented 7 years ago

Have you looked at the links I provided? Also note that to support numpy types, pyrolite would have to contain converters for all numpy machine types to the appropriate java and .net primitives. I prefer not to include that bloat in the pyrolite library and instead to rely on the existing type converters that are already present on the python side.

icexelloss commented 7 years ago

Hi,

Thanks for the information. I have looked at all the links, but not sure what's the best way to approach this. I will ping https://issues.apache.org/jira/browse/SPARK-12157 and checkout net.razorvine.pickle.Unpickler.registerConstructor(...)

I don't fully understand the complexity of support numpy types in pyrolite, but if you think that's not the best way to approach the pyspark issue, free feel to close this issue.

Thanks again for your help.

irmen commented 7 years ago

To get an idea of what is involved, have a look at https://github.com/irmen/Pyrolite/blob/master/java/src/main/java/net/razorvine/pickle/objects/ArrayConstructor.java

This is the code needed to convert pyhon's built-in array type into java equivalents.

For numpy, something similar is required. I've not had the time or felt the need to write this myself. Also I'm not overly familiar with numpy. I hope someone else could do the job and open a pull request 😉

(a similar conversion class exists in the .NET code tree and that would have to be updated as well, including unit tests for both java and .net)

irmen commented 7 years ago

Closing this for now, because I'm not going to add this myself. Still accepting pull requests for it though

irmen / Pyrolite

Support pickle/unpickle for numpy types #52