tweag / sparkle

Haskell on Apache Spark.
BSD 3-Clause "New" or "Revised" License
447 stars 30 forks source link

Anonymous classes in inline code are loaded too late for serialization #104

Closed facundominguez closed 7 years ago

facundominguez commented 7 years ago
[java| $rdd.map(new Function<Object,Object>() { Object call(Object x){ return x;} }) |]

doesn't work on multi-node setups.

The problem is that an executor receives the serialized object new Function<Object,Object>() { Object call(Object x){ return x;} }, and in order to deserialize it, it needs to load the anonymous class to which it belongs.

The executor then notices that no jar and no class in the classpath contains the class definition and therefore it fails. Where is the class then? Currently inline-java embeds the bytecode in the Haskell executable. The embedded bytecode is sent to the JVM at runtime by Language.Java.Inline.loadJavaWrappers. But this function is never called on executors.

The ideal fix would be for spark to provide some startup hooks, so Language.Java.Inline.loadJavaWrappers can be called when the executor starts. But this feature is not implemented.

Calling loadJavaWrappers when sparkle loads is no good, because upon receiving the serialized object, the executor has no clue that it needs to load sparkle in order to have the class defined.

The only workaround I've found so far, is to dump the .class files that inline-java produces into a folder and add them to the sparkle application jar. https://github.com/tweag/inline-java/issues/62

Any preferences on how to better deal with this?

mboes commented 7 years ago

Note that this problem isn't specific to inline-java: it's a problem for any use of JNI's defineClass, which Spark currently provides no way of performing preemptively at initialization time. It sounds to me like this is an upstream issue. Which we could work around in various ways in inline-java for that particular special case.

facundominguez commented 7 years ago

Which we could work around in various ways in inline-java for that particular special case.

Indeed, any preference?

mboes commented 7 years ago

Since as noted above this is an upstream issue with Spark itself, I have a preference for keeping any workaround in sparkle. We could remove the workaround once the ticket you mention above is resolved. Inline code is just stubs and these stubs are best kept in the executable itself. No one other than the executable should see these stubs, nor should they be able to call them. And that way we don't need to parameterize the java QQ with a gazillion (aka 1-3) options whose combinations are hard to test exhaustively.

The JIRA ticket you mention includes comments from several folks who successfully hooked into JavaSerializer. We could call loadJavaWrappers once (or all the time), from the serializer. Did the "epic struggle" you mention in inline-java#62 include that already?

facundominguez commented 7 years ago

I didn't try it. Apparently we need to

  1. Extend org.apache.spark.Serializer. This is a wrapper that will load sparkle in a static block and will forward calls to the appropriate serializer.
  2. Set our instance with sparkConf.set("spark.serializer","our.serializer.class.name")

I don't see how this can be parameterized by the serializer that spark currently uses. We might have to define a different wrapper for each serializer we ever want to use.

facundominguez commented 7 years ago

This is not quite as usable as sparkle users would need. When the classes that loadJavaWrappers loads depend on classes in sparkle.jar, the class loader can't find them when loadJavaWrappers is invoked in InlineJavaRegistrator.java.

mboes commented 7 years ago

Could we please not unearth old issues from the dead and instead create a new one?