vegas-viz / Vegas

The missing MatPlotLib for Scala + Spark
MIT License
730 stars 98 forks source link

Push aggregations down to Spark #117

Open jeremyrsmith opened 7 years ago

jeremyrsmith commented 7 years ago

When using withDataFrame, Vegas collects all the data and has a threshold for sampling instead.

But when doing aggregations in your plot, this means it will fetch all the data to the driver – potentially sampling it – and push all of it to vega-lite, where the aggregation will happen in JavaScript in the browser. This is probably never what you want.

It would be totally possible to map AggOps to Spark aggregations, and push the aggregation itself down to Spark. This will reduce the cardinality of the data dramatically, and would probably eliminate the need to sample in most cases.

oshikiri commented 6 years ago

Thanks, @jeremyrsmith

This is probably never what you want.

I also agree with that. I think that the default behaviour should be changed; it had better pass all the data by default to vega-lite. https://github.com/vegas-viz/Vegas/blob/1496432875f80e9e579cc584fa8fd299f34a71a6/spark/src/main/scala/vegas/sparkExt/package.scala#L8-L17