utdemir / distributed-dataset

A distributed data processing framework in Haskell.
BSD 3-Clause "New" or "Revised" License
114 stars 5 forks source link
aws-lambda data-processing distributed haskell spark

distributed-dataset

CI Status

A distributed data processing framework in pure Haskell. Inspired by Apache Spark.

Packages

distributed-dataset

This package provides a Dataset type which lets you express and execute transformations on a distributed multiset. Its API is highly inspired by Apache Spark.

It uses pluggable Backends for spawning executors and ShuffleStores for exchanging information. See 'distributed-dataset-aws' for an implementation using AWS Lambda and S3.

It also exposes a more primitive Control.Distributed.Fork module which lets you run IO actions remotely. It is especially useful when your task is embarrassingly parallel.

distributed-dataset-aws

This package provides a backend for 'distributed-dataset' using AWS services. Currently it supports running functions on AWS Lambda and using an S3 bucket as a shuffle store.

distributed-dataset-opendatasets

Provides Dataset's reading from public open datasets. Currently it can fetch GitHub event data from GH Archive.

Running the example

Stability

Experimental. Expect lots of missing features, bugs, instability and API changes. You will probably need to modify the source if you want to do anything serious. See issues.

Contributing

I am open to contributions; any issue, PR or opinion is more than welcome.

Nix

Stack

Related Work

Papers

Projects