vega / vegafusion

Serverside scaling for Vega and Altair visualizations
https://vegafusion.io
BSD 3-Clause "New" or "Revised" License
336 stars 18 forks source link

Announcement blog posts #12

Closed jonmmease closed 2 years ago

jonmmease commented 2 years ago

We'll want to publish pretty thorough blog post (or more likely a series of blog posts) announcing the project.

Section 1 - tl;dr

There should be a tl;dr section with instructions for how to install and activate VegaFusion in JupyterLab, along with a pretty GIF. This is targeting a user who just wants to copy and paste and see if it works before investing any more effort into understanding why they should care about the project.

Section 2 - Why a Python Data Scientist should care

Then there should be a section explaining why a Python Data Scientist should care about VegaFusion. This would include some brief background on Vega/Vega-Lite and Altair. This would emphasize the inclusion of transforms in the grammar, and how these enable automatic interactive workflows like linked brushing on histograms.

Then some good diagrams explaining that with Vega.js, all of the transforms are performed in the browser, and so all of the raw data must be sent to the browser (either inline in the spec, or loaded from a url).

The magic of VegaFusion is that it automatically extracts as much data processing work as possible and performs it on the server, while still supporting full interactivity.

To use VegaFusion, this is all you need to know. But read on to learn more about how it all works

Section 3 - How the current system works

Explain that VegaFusion is built in Rust on top of Arrow and DataFusion. The Vega expression language is compiled into the DataFusion expression language. And Vega transforms are compiled into DataFusion queries. The extensibility of DataFusion is used to add support for custom Vega expression and aggregation functions.

A planning stage is used to parse the original Vega spec and identify the signal and data dependencies. Then the spec is split into two valid specs. One that runs on the server using the runtime built on DataFusion, the other that runs in the browser using the standard Vega.js library. The planning phase also identifies a communication plan which includes all of the signals and datasets that need to be transfered between the two specifications.

The spec parsing and planning logic is compiled to WebAssembly and executed in the browser. The JupyterWidget protocol is used to transfer data between server and client.

Section 4 - How the design will enable additional use cases

The system is designed to be embedded in a variety of web contexts. The initial focus is on the Jupyter Widget use case, but there is a relatively small amount of code that is Jupyter specific. The roadmap includes Dash, Streamlit, and Panel support.

The server uses an efficient caching scheme to avoid duplicate calculations and to support many simultaneous visualizations of the same dataset without increased memory usage.

In the initial jupyter case, for convenience the runtime is embedded in the Python process. But the runtime could also run in a server configuration, allowing multiple processes to connect to it. This would be the preferred configuration for serving a Dashboard to many users using Voila, Dash, Streamlit, or Panel.

Protocol buffers were chosen as the binary serialization format with the intention of hosting the VegaFusion runtime as a gRPC service.

Discussion of the state model.

Like Dash, each client maintains the full user state so the server is not required to keep an active record of the clients that are connected to it, and the server is not obligated to maintain the session state of each user. The difference from Dash is that the client isn't required to store the current value of every node in the Task graph, only a unique fingerprint for that state. The result is that the client can maintain the state of a task graph that includes very large datasets without the requirement to store the datasets itself.

Section 5 - Additional feature roadmap

Support compiling runtime itself to WebAssembly, making it possible to use DataFusion to accelerate calculations in the browser. Also making it possible to mix where calculations take place.

Support scales. Update planner to run encoding logic on the server.

Support rendering

Section 6

Explanation of initial plan to license as AGPL3 with CLA.

jonmmease commented 2 years ago

Done in https://medium.com/@jonmmease/announcing-vegafusion-570f62207ba7