utdemir / distributed-dataset

A distributed data processing framework in Haskell.
BSD 3-Clause "New" or "Revised" License
114 stars 5 forks source link

YARN Backend #15

Open utdemir opened 5 years ago

utdemir commented 5 years ago

YARN is the most common way to schedule Spark & Hadoop on a cluster.

Supporting it as an executor will enable us to run side-by-side with existing data processing pipelines.

utdemir commented 5 years ago

I spent a bit of time experimenting today. My first idea was to use inline-java to directly interface with JVM. However it turns out it adds considerable complexity to the build process.

I've decided on a simpler approach of creating a wrapper Java application responsible for interfacing with YARN and communicating with the Haskell executable. Since we only need one type of message (spawn an executor and return the result) I believe the interface between Java and Haskell will be quite small. Initially, I will probably create a simple protocol using UNIX pipes.