This PR updates the planner to combine (fuse) parent datasets into their children in certain situations.
The motivation here is to ensure that the root datasets in the server spec have more of the transform pipeline than previously.
A simple motivating example is that Vega-Lite can sometimes generate Vega specs where the root dataset has no transforms and then a child dataset has a series of transforms. For example:
Without these changes to the planner, VegaFusion will pull the entire source_0 dataset into the runtime's cache. Then the data_0 transforms are applied to the cached version of source_0. This breaks down when source_0 is larger than memory (e.g. it's a snowflake table). In this case, it's much better to perform the aggregation before caching the result. This PR will convert the example above into:
The data_0 dataset now loads the source data and applies the aggregation immediately. This will work with larger-than-memory data (as long as the aggregated result fits in memory).
The current heuristic is to fuse everything up to and including datasets that have an aggregate transform. The rationale is that aggregate results are usually smaller that source data and will be reasonable to cache. This is an area that could be made more sophisticated in the future, but it should help a lot in the majority of cases.
To make it possible to inspect the planner solution in Python, this PR adds a vf.runtime.build_pre_transform_spec_plan(vega_spec) method that returns the plan as a Python dict.
This PR updates the planner to combine (fuse) parent datasets into their children in certain situations.
The motivation here is to ensure that the root datasets in the server spec have more of the transform pipeline than previously.
A simple motivating example is that Vega-Lite can sometimes generate Vega specs where the root dataset has no transforms and then a child dataset has a series of transforms. For example:
Without these changes to the planner, VegaFusion will pull the entire
source_0
dataset into the runtime's cache. Then thedata_0
transforms are applied to the cached version ofsource_0
. This breaks down whensource_0
is larger than memory (e.g. it's a snowflake table). In this case, it's much better to perform the aggregation before caching the result. This PR will convert the example above into:The
data_0
dataset now loads the source data and applies the aggregation immediately. This will work with larger-than-memory data (as long as the aggregated result fits in memory).The current heuristic is to fuse everything up to and including datasets that have an
aggregate
transform. The rationale is that aggregate results are usually smaller that source data and will be reasonable to cache. This is an area that could be made more sophisticated in the future, but it should help a lot in the majority of cases.To make it possible to inspect the planner solution in Python, this PR adds a
vf.runtime.build_pre_transform_spec_plan(vega_spec)
method that returns the plan as a Python dict.