rjagerman / glint

Glint: High performance scala parameter server
MIT License
168 stars 62 forks source link

Easier serialization for Spark #38

Closed rjagerman closed 8 years ago

rjagerman commented 8 years ago

It is not immediately obvious how to use the implicit execution context and timeout together with Spark. In particular, the following piece of code will not run:

implicit val ec = ExecutionContext.Implicits.global
implicit val timeout = new Timeout(30 seconds)
rdd.foreach {
    case value => vector.push(Array(0L), Array(value))
}

Spark attempts to serialize both the execution context and timeout which causes errors because these objects are never meant to be serialized. Instead, one would have to write something like this:

rdd.foreach {
    case value =>
      implicit val ec = ExecutionContext.Implicits.global
      implicit val timeout = new Timeout(30 seconds)
      vector.push(Array(0L), Array(value))
}

To make this easier, it might be a good idea to either remove the implicits and make these configurable defaults or to add new methods that don't have the implicits as a requirement.

rjagerman commented 8 years ago

The pull and push methods return a scala Future. To attach callbacks to these futures we will need access to the execution context regardless. Therefor I think it is best to keep the execution context in the call.

To simplify the code a little bit, we can change the implicit timeout to a configurable property and remove it from the call API. The most recent version of the code uses a hand-shake protocol for push request, so the timeout property is largely ignored there anyway.

In the mean-time, I've written a very simple example about serialization with Spark in the documentation, that can be run very easily in the spark-shell: http://rjagerman.github.io/glint/gettingstarted/spark/