Push aggregations down to Spark

vegas-viz / Vegas

The missing MatPlotLib for Scala + Spark

MIT License

730 stars 98 forks source link

When using withDataFrame, Vegas collects all the data and has a threshold for sampling instead.

But when doing aggregations in your plot, this means it will fetch all the data to the driver – potentially sampling it – and push all of it to vega-lite, where the aggregation will happen in JavaScript in the browser. This is probably never what you want.

It would be totally possible to map AggOps to Spark aggregations, and push the aggregation itself down to Spark. This will reduce the cardinality of the data dramatically, and would probably eliminate the need to sample in most cases.

vegas-viz / Vegas

Push aggregations down to Spark #117