Replace Spark with Dask

The current version of Drudge uses Spark (PySpark) for parallelism. Despite the computational speed-up it brings, the dependence on Spark adds an extra layer of complexity for developing, maintaining, and using Drudge. Typically encountered during Drudge development is that the program crashes inside some Spark code while throwing some Scala/Java error message, which is hard to understand for a Python developer. Our current workaround is to use dummy_spark for debugging and pyspark for production. However, this two-step approach turns out to be difficult for large-scale problems. Moreover, a non-Python library complicates the deployment.

An alternative to Spark is the Dask library. Dask is implemented in pure Python and integrates well with other scientific/numeric/HPC Python libraries. It works on single workstations as well as clusters. Having a Dask backend makes it easier to debug and profile Drudge codes without sacrificing performance. For implementation, Dask collections such as dask.bag (as a replacement for Spark RDD) or dask.delayed may be a good place to start.

Gaurav and I have talked about switching from Spark to Dask. We may start experimenting when time allows.

tschijnmo / drudge

Replace Spark with Dask #15