gorillalabs / sparkling

A Clojure library for Apache Spark: fast, fully-features, and developer friendly
https://gorillalabs.github.io/sparkling/
Eclipse Public License 1.0
448 stars 68 forks source link

Reloaded workflow #36

Open kovasb opened 8 years ago

kovasb commented 8 years ago

I have some ideas for setting up sparkling to work in a "reloaded" style but I don't have time to track down the issues right now. Maybe someone has the answers here.

The desired workflow is: edit your source code locally, new source code is automatically reloaded on the driver and workers. To me this is a much more sensible approach than trying to define stuff over nrepl to each worker.

Poking around, AOT is not strictly necessary for function serialization. If the clojure namespaces are loaded, the corresponding classes are created and will serialize-deserialize as expected. (Currently this is failing for records, but I think that just needs more kryo definitions). You can see this just by defining an eval function, mapping it across quoted defn's on the workers, evaling the defn and using it as another spark op in the driver. I haven't tried pomegranate yet, hopefully we can even dynamically fetch dependencies.

Given this works, it should be possible to achieve the reloaded workflow. The steps would be:

  1. Detect change in local file system and produce jar of just the src files
  2. Put that jar in the distributed cache
  3. Invoke a reload fn on each executor using map-partitions and collect'ing (put some logic in so it only happens once per execution)
  4. Reload fn can use tools.namespace to figure out what to reload. (This might require some poking around tools.namespace)

Presumably one would want to have separate namespaces for the driver and the workers, and set of some kind of component at each one.

chrisbetz commented 8 years ago

see #26

bowbahdoe commented 5 years ago

Updates?