Open jeremyrsmith opened 7 years ago
Thanks, @jeremyrsmith
This is probably never what you want.
I also agree with that. I think that the default behaviour should be changed; it had better pass all the data by default to vega-lite. https://github.com/vegas-viz/Vegas/blob/1496432875f80e9e579cc584fa8fd299f34a71a6/spark/src/main/scala/vegas/sparkExt/package.scala#L8-L17
When using
withDataFrame
, Vegas collects all the data and has a threshold for sampling instead.But when doing aggregations in your plot, this means it will fetch all the data to the driver – potentially sampling it – and push all of it to vega-lite, where the aggregation will happen in JavaScript in the browser. This is probably never what you want.
It would be totally possible to map
AggOps
to Spark aggregations, and push the aggregation itself down to Spark. This will reduce the cardinality of the data dramatically, and would probably eliminate the need to sample in most cases.