haskell-distributed / distributed-process

Cloud Haskell core libraries
http://haskell-distributed.github.io
713 stars 97 forks source link

Distinguish between Cloud Haskell and distributed haskell #339

Open hyperthunk opened 5 years ago

hyperthunk commented 5 years ago

See https://groups.google.com/forum/#!topic/parallel-haskell/Hr6vkpz3FB4 for background info.

The heart of this topic seems to be a question as to whether we should distinguish between Haskell in the Cloud (e.g., on some IaaS/PaaS engine, and utilising orchestration technologies such as Kubernetes), and distributed haskell (a set of libraries and primitives for building applications that communicate across process and/or machine boundaries.

hyperthunk commented 5 years ago

Some comments by Alexander Kjeldaas:

I'd like to also point out some counter-points to the idea that "Spark, but in Haskell" is a great idea and that it requires code shipping.

There is one glaring weakness in the Spark model of execution that is solved by separating code shipping from execution. Security is THE big issue with Spark.

Let's say you have a petabyte in your data lake and you give your engineers Spark as the tool to accomplish their tasks. You want to be able to optimize resource usage, so having a cluster to execute your analysis, jobs and what-have-you makes sense. The alternative is to use for example Amazon EMR with one cluster per employee - extremely inefficient resource-wise as you can expect very low cluster utilization.

However, enter GDPR and other laws - you need to be able to document that you don't give out access to data needlessly. Now spark starts to fail, because the computational model is "ship code". It is hard/slow/difficult to create a cluster from scratch, but that's what you need in order to enforce the security requirement that the individual engineers have different access restrictions imposed on them.

Spark has no business enforcing access restrictions on external data. That's re-creating yet another security layer that needs to be managed. Rather, if the creation of a cluster-wide computation by using fresh processes allocated across a cluster using an existing cluster manager is cheap, you will get what you want - secure isolation of jobs.

The combination of not being able to enforce security policies on shared clusters, and not being able to dynamically increase/decrease the cluster size on a real cloud leads to very inefficient use of resources in Spark. What a sorting benchmark looks like in Spark doesn't matter at all compared to all the idle CPU cycles the wrong computational model entails.

You can say that these are orthogonal issues, but the ability to have a good cluster agent library and maybe a build system that can easily integrate with this library in order to quickly bootstrap a virtual cluster, scale it up and down based on what the app wants, and ship binaries, is a lot more useful in the real world where security must be managed and audited.

In this kind of setup, shipping code becomes mostly irrelevant as there is always an existing agent that can create a VM, a job, a process or similar, and that's the agent we need to interact with in a seamless manner, not build a competing agent within our own executable that should receive and execute closures.

And further

I feel that I might be derailing the discussion, and I am not presenting strong arguments against shipping closures. However you have been discussing the "developer experience", and as a product feature, I don't think it's as important as great integration with orchestration systems. Assuming a great orhestration system is already available opens up new points in the design space, and then it might be possible to innovate instead of imitate.

Btw, another huge issue with Spark is the caching layer. As it doesn't provide proper security, the default mode of operation on Amazon for example is to ship data to and from S3 which is enormously inefficient compared to a secure caching layer. This is another performance hit taken because the architecture (IMO) is wrong.

This feels like a rather contrived situation, but I can appreciate what you're saying.

Some context - my experience is from an organization where there are maybe 40 subsidiaries - wildly varying in size, technical competencies and preferences, and thus very different requirements - and those always changing. There are legally multiple controllers (in a GDPR sense) in this environment, so enforcing compliance through policy and procedures is not an option. In this environment we chose Spark as our workhorse around 4 years back, and we still use it. It's not bad, but what looked very attractive initially turned out to require quite a lot of work to get basic "tenant" security and to optimize resource usage. Also, the turn-around time for ad-hoc use is slow enough to be out-competed by alternatives, basically because it assumes an existing cluster.

Well yes, you could just use https://github.com/weaveworks/weave and spin things up in a fresh environment, sizing as you please. Or a million other approaches.

Yes, and today container orchestration is a part of the stack, just like TCP/IP or VMs. So it's beneficial to take a stand and make this explicit. Unlike when Akka, Spark etc were designed, today we can assume that some set of orchestration APIs similar to what kubernetes provides is available. So for example instead of:

"Given a cluster of servers, our wonderful system can utilize them to the max by .. " it's more like: "Given an orchestration system, our wonderful system will automatically express priority, deployment preferences, latency requirements, security config and emit data and code containers that works well with your existing orchestration system using.. "

We can assume a lot more, so we should be able to simplify and simultaneously improve on the spark model. Part of the assumption is that if the system is able to produce a container, then deployment and scaling is trivially done for us. We can assume that we can encapsulate data and have it be distributed along with our code, but also securely shared with other tenants as a caching layer. We can assume that the orchestration layer can atomically switch code for us so we don't need to be binary compatible, or not switch atomically but just ensure that we can discover all components we need. For a REPL use-case (like a Spark REPL), we could think in terms of creating a container, deploying it, and tearing it down for every expression typed into the repl (should be <3s overhead at 99% percentile on kubernetes for example). We could encapsulate data (like RDDs in spark) as separate objects and gain superior caching and security, use POD abstractions and local communication with simpler failure modes (I think this is part of what you suggested).

We can simplify by assuming an orchestration API. Don't distribute code, that's the job of the orchestration system - just produce containers quickly. Don't try to utilize the cluster to the fullest, try to express the intent and priority of the current task and let the external scheduler figure it out. Don't assume any minimum or maximum number of nodes, but make it possible to express desires and handle whatever allocation is given to you. Don't try to handle multiple tenants or security, but make it possible to express / export this info to the orchestrator. For caching, let the orchestration system handle access controls between the caching layer and the executing code.

We can actually assume something more specific than just any orchestration API, since kubernetes won (https://twitter.com/redmonk/status/1000066226482221064). This also simplifies a lot.

This isn't just limited to the spark use-case. If there's a solid framework for this sort of stuff, normal cabal or stack builds should likewise be accellerated using cloud haskell. I do most of my haskell builds on a cloud VM, and I have a spare kubernetes cluster laying around. If I don't, then I can build one in a minute or so in all major clouds. Given the rate of kubernetes offerings being launched, in one more year there won't be anyone not having some spare capacity in their kubernetes cluster that could be useful for their day-to-day builds.

Again, this is a bit orthogonal to code shipping, but I think it's core to the update of cloud haskell.