tschijnmo / drudge

CAS based on sympy focusing on tensor and noncommutative algebras
https://tschijnmo.github.io/drudge
MIT License
20 stars 10 forks source link

Replace Spark with Dask #15

Open chenpeizhi opened 4 years ago

chenpeizhi commented 4 years ago

The current version of Drudge uses Spark (PySpark) for parallelism. Despite the computational speed-up it brings, the dependence on Spark adds an extra layer of complexity for developing, maintaining, and using Drudge. Typically encountered during Drudge development is that the program crashes inside some Spark code while throwing some Scala/Java error message, which is hard to understand for a Python developer. Our current workaround is to use dummy_spark for debugging and pyspark for production. However, this two-step approach turns out to be difficult for large-scale problems. Moreover, a non-Python library complicates the deployment.

An alternative to Spark is the Dask library. Dask is implemented in pure Python and integrates well with other scientific/numeric/HPC Python libraries. It works on single workstations as well as clusters. Having a Dask backend makes it easier to debug and profile Drudge codes without sacrificing performance. For implementation, Dask collections such as dask.bag (as a replacement for Spark RDD) or dask.delayed may be a good place to start.

Gaurav and I have talked about switching from Spark to Dask. We may start experimenting when time allows.