Open turtlemonvh opened 4 years ago
For another nice example, see: https://github.com/LucaCanali/sparkMeasure/blob/master/python/sparkmeasure/taskmetrics.py
Note that in most of these wrappers there is a separate python library published, but the end user will still need to install the jar (containing the core scala code + java wrappers) onto their Spark cluster to use the python wrappers.
Depending on the complexity of the transformer configuration added in #6 , we will want the user to be able to pass in a python-specific version of this configuration. We will then translate all these python objects to java objects before calling the java wrapper.
Py4J auto translations should make this easier, e.g. https://stackoverflow.com/questions/15808017/py4j-dict-to-java-map
We want to be able to use the same high performance transform function from PySpark in addition to Java/Scala.
More of wrapping Scala for pyspark
Per these posts, py4j only runs on the master, not in worker nodes.
Wrapping Scala code in a simple Java wrapper for easier calling from python is a common approach.