vega / vegafusion

Serverside scaling for Vega and Altair visualizations
https://vegafusion.io
BSD 3-Clause "New" or "Revised" License
317 stars 17 forks source link

Add fuse_datasets planner configuration for combining datasets during planning #407

Closed jonmmease closed 11 months ago

jonmmease commented 11 months ago

This PR updates the planner to combine (fuse) parent datasets into their children in certain situations.

The motivation here is to ensure that the root datasets in the server spec have more of the transform pipeline than previously.

A simple motivating example is that Vega-Lite can sometimes generate Vega specs where the root dataset has no transforms and then a child dataset has a series of transforms. For example:

  "data": [
    {"name": "source_0", "url": "vegafusion+dataset://source"},
    {
      "name": "data_0",
      "source": "source_0",
      "transform": [
        {
          "type": "aggregate",
          "groupby": ["A"],
          "ops": ["average"],
          "fields": ["B"],
        }
      ]
    },
  ]

Without these changes to the planner, VegaFusion will pull the entire source_0 dataset into the runtime's cache. Then the data_0 transforms are applied to the cached version of source_0. This breaks down when source_0 is larger than memory (e.g. it's a snowflake table). In this case, it's much better to perform the aggregation before caching the result. This PR will convert the example above into:

  "data": [
    {"name": "source_0"},
    {
      "name": "data_0",
      "url": "vegafusion+dataset://source"
      "transform": [
        {
          "type": "aggregate",
          "groupby": ["A"],
          "ops": ["average"],
          "fields": ["B"],
        }
      ]
    },
  ]

The data_0 dataset now loads the source data and applies the aggregation immediately. This will work with larger-than-memory data (as long as the aggregated result fits in memory).

The current heuristic is to fuse everything up to and including datasets that have an aggregate transform. The rationale is that aggregate results are usually smaller that source data and will be reasonable to cache. This is an area that could be made more sophisticated in the future, but it should help a lot in the majority of cases.

To make it possible to inspect the planner solution in Python, this PR adds a vf.runtime.build_pre_transform_spec_plan(vega_spec) method that returns the plan as a Python dict.