A distributed data processing framework in pure Haskell. Inspired by Apache Spark.
This package provides a Dataset
type which lets you express and execute
transformations on a distributed multiset. Its API is highly inspired
by Apache Spark.
It uses pluggable Backend
s for spawning executors and ShuffleStore
s
for exchanging information. See 'distributed-dataset-aws' for an
implementation using AWS Lambda and S3.
It also exposes a more primitive Control.Distributed.Fork
module which lets you run IO
actions remotely. It
is especially useful when your task is embarrassingly
parallel.
This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.
Provides Dataset
's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.
Clone the repository.
$ git clone https://github.com/utdemir/distributed-dataset
$ cd distributed-dataset
Make sure that you have AWS credentials set up. The easiest way is to install AWS command line interface and to run:
$ aws configure
Create an S3 bucket to put the deployment artifact in. You can use the console or the CLI:
$ aws s3api create-bucket --bucket my-s3-bucket
Build an run the example:
If you use Nix on Linux:
(Recommended) Use my binary cache on Cachix to reduce compilation times:
nix-env -i cachix # or your preferred installation method
cachix use utdemir
Then:
$ nix run -f ./default.nix example-gh -c example-gh my-s3-bucket
If you use stack (requires Docker, works on Linux and MacOS):
$ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket
Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.
I am open to contributions; any issue, PR or opinion is more than welcome.
distributed-dataset
, you can use;
Nix
, cabal-install
or stack
.stack
with docker
.nix-shell
will drop you into a shell with ormolu
, cabal-install
and
steeloverseer
alongside with all required haskell and system dependencies.
You can use cabal new-*
commands there.sos
at the
top level directory inside of a nix-shell.Docker
installed.stack
as usual, it will automatically use a Docker image./make.sh stack-build
before you send a PR to test different resolvers.