PySpark support for encryption transform function

turtlemonvh commented 4 years ago

We want to be able to use the same high performance transform function from PySpark in addition to Java/Scala.

More of wrapping Scala for pyspark

Per these posts, py4j only runs on the master, not in worker nodes.

Wrapping Scala code in a simple Java wrapper for easier calling from python is a common approach.

In the context of Spark, the key thing to remember is to appropriately wrap/unwrap your objects. Spark has 3 versions of each class (Scala, Java, Python) and the Python classes are wrappers over the Java classes.

turtlemonvh commented 4 years ago

For another nice example, see: https://github.com/LucaCanali/sparkMeasure/blob/master/python/sparkmeasure/taskmetrics.py

Note that in most of these wrappers there is a separate python library published, but the end user will still need to install the jar (containing the core scala code + java wrappers) onto their Spark cluster to use the python wrappers.

turtlemonvh commented 4 years ago

Depending on the complexity of the transformer configuration added in #6 , we will want the user to be able to pass in a python-specific version of this configuration. We will then translate all these python objects to java objects before calling the java wrapper.

Py4J auto translations should make this easier, e.g. https://stackoverflow.com/questions/15808017/py4j-dict-to-java-map

turtlemonvh / ionic-spark-utils

PySpark support for encryption transform function #9