vega / vegafusion

Serverside scaling for Vega and Altair visualizations
https://vegafusion.io
BSD 3-Clause "New" or "Revised" License
317 stars 17 forks source link

Add DataFusion datasource implementation in Python for pandas and DataFrame Interchange #438

Closed jonmmease closed 9 months ago

jonmmease commented 9 months ago

Closes https://github.com/hex-inc/vegafusion/issues/386

This PR adds custom DataFusion datasources, written in Python, for pandas DataFrames and objects that adhere to the DataFrame Interchange Protocol (i.e. have a __dataframe__ method). Using this approach, we no longer convert these objects directly to arrow before passing them into VegaFusion. Instead, they are converted to arrow dynamically during the DataFusion query. This makes it possible to down select to the required columns before converting to Arrow, which can be much much faster for DataFrames that include lots of columns.

Example of 10 million row histogram with pandas:

import altair as alt
import pandas as pd
alt.data_transformers.enable("vegafusion")

movies = pd.read_json("https://raw.githubusercontent.com/vega/vega-datasets/main/data/movies.json")
movies = pd.concat([movies]*3200, axis=0).reset_index(drop=False)
print(len(movies))
source = movies
chart = alt.Chart(source).mark_bar().encode(
    alt.X("IMDB Rating:Q", bin=True),
    y='count()',
)
chart

10243200

visualization

Test performing with chart.to_dict()

%%timeit
d = chart.to_dict(format="vega")

Before: 7.04 s ± 81.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) After: 336 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's 20x faster!

This is because we're only converting 1 out of 17 columns to arrow, and skipping several string columns which are particularly slow to convert.

The duckdb connection against pandas is still a bit faster, but the DataFusion connection is now within a factor of 2 for this example.

DuckDB: 199 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The performance results for DataFrame Interchange Protocol objects will vary depending on how expensive it is to convert their contents to arrow using PyArrow. In this case we're using dfi.select_columns_by_name(columns) to filter down the source columns before converting to Arrow.

cc @ivirshup