Closed icexelloss closed 7 years ago
something related to this is written here https://issues.apache.org/jira/browse/SPARK-12157
Also, you can register custom deserializers. Have you tried that? See net.razorvine.pickle.Unpickler.registerConstructor(...)
My view on supporting Numpy types has been so far to encourage conversion to a built-in type first. This avoids lots of issues http://pyro4.readthedocs.io/en/stable/tipstricks.html#pyro-and-numpy
Have you looked at the links I provided? Also note that to support numpy types, pyrolite would have to contain converters for all numpy machine types to the appropriate java and .net primitives. I prefer not to include that bloat in the pyrolite library and instead to rely on the existing type converters that are already present on the python side.
Hi,
Thanks for the information. I have looked at all the links, but not sure what's the best way to approach this. I will ping https://issues.apache.org/jira/browse/SPARK-12157 and checkout net.razorvine.pickle.Unpickler.registerConstructor(...)
I don't fully understand the complexity of support numpy types in pyrolite, but if you think that's not the best way to approach the pyspark issue, free feel to close this issue.
Thanks again for your help.
To get an idea of what is involved, have a look at https://github.com/irmen/Pyrolite/blob/master/java/src/main/java/net/razorvine/pickle/objects/ArrayConstructor.java
This is the code needed to convert pyhon's built-in array type into java equivalents.
For numpy, something similar is required. I've not had the time or felt the need to write this myself. Also I'm not overly familiar with numpy. I hope someone else could do the job and open a pull request 😉
(a similar conversion class exists in the .NET code tree and that would have to be updated as well, including unit tests for both java and .net)
Closing this for now, because I'm not going to add this myself. Still accepting pull requests for it though
Hello,
I am a PySpark user. In PySpark, it currently uses Pyrolite to pickle/unpickle data in jvm and send to/from python. Here is where it gets used:
https://github.com/apache/spark/blob/6a5a7254dc37952505989e9e580a14543adb730c/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L25
It is a known issue that this doesn't work with numpy types in python, that is, if the user tries to send a numpy type to the jvm, it throws an exception inside:
https://github.com/irmen/Pyrolite/blob/master/java/src/main/java/net/razorvine/pickle/Unpickler.java#L154
The exception is something like:
I am wondering what's your thought on supporting numpy types?