Performance question about Voyager.

Please:

[X] Check for duplicate issues
[X] Describe how to reproduce the bug / the goal of the new feature request
[NA] Provide an example spec in JSON and, if applicable, screenshots or GIF videos (e.g., using https://www.cockos.com/licecap/)

Hi all! We are beginning to work on integration between Voyager and JupyterLab. Thanks to the work of @saulshanabrook we have an initial MVP of the Voyager UI in JupyterLab:

https://github.com/altair-viz/jupyterlab_voyager

@saulshanabrook and a Master's student of mine @zzhangjii are going to be working on this project. We have opened an issue on that repo (https://github.com/altair-viz/jupyterlab_voyager/issues/5) to begin discussing performance issue of Voyager in JupyterLab. I am guessing that most of the things are more general performance issues, so I wanted to raise them here. To get things started, a few questions:

Does each Vega-Lite view in Voyager (Specified and all Related Views) have a copy (or reference, shallow copy) of the data? What are the performance issues of that?
If a copy, does each Vega-Lite view have only the columns used by that view, or all of the columns?

@zzhangjii has started to do a performance analysis of using Voyager with different sized datasets and I have asked him to post that in the above linked issue.

A more fundamental UI/UX design question that is related to performance:

Would it make sense to render the related views one by one, and as each one is rendered, replace the full render with its png so there is only one live at a time?

Hi @ellisonbg

Sorry for a delayed reply. I've been busy preparing my talk about TensorFlow Graph Visualization for the VAST conference next week.

Here are some replies.

Does each Vega-Lite view in Voyager (Specified and all Related Views) have a copy (or reference, shallow copy) of the data? What are the performance issues of that?

In Voyager, we use named data source and bind a filtered data (if Voyager users select some filter) into each plot. (This is only filtering rows.)

Vega-Lite currently creates two data sources in the underlying Vega for named data source (shown below). So it definitely involves some copying.

If I remember correctly:

Upon data ingest, Vega directly assigns each data object a unique ID.
For performance reason, Vega does not always create a copy of the data bound via the API.
The second data source from VL's name data will cause Vega to create a new copy of the data that run through the transforms and prevent Vega from modifying the data that we bind.

(cc: @jheer and @domoritz -- please verify if I remember correctly and feel free to add more details)

The only optimization we have done so far is the filtering part as all plots use the same filtered data.

At the basic level, one obvious optimization to do is separating data aggregation from Vega-lite specs in Voyager and use vega-dataflow-api to aggregate instead.

We can make plots that use exactly the same aggregated data such as the alternative encoding pane share the same aggregated data copy and thus do not have to do the aggregation repeatedly.

@leibatt has worked on some ways to optimize data processing in Voyager too.

If a copy, does each Vega-Lite view have only the columns used by that view, or all of the columns?

Great question! Currently, all columns.

That said, given that Vega now have the new project, it is possible to add a feature for Vega-lite to include only the required fields (columns). This is definitely useful when copying happens for raw plot, but I'm not sure if it is useful for aggregate plot (especially if data copying only happens for the second data source.)

@saulshanabrook and a Master's student of mine @zzhangjii are going to be working on this project.

Nice. @saulshanabrook @zzhangjii -- Nice to meet you and thank you for helping.

Meanwhile, I was about to inform you that @felixcodes, an undergrad student who worked with me on Voyager for the past 2 years and his friends @ssharif6 @abanh206 and @fionc also want to help improve Voyager UX for using in real world including integration with JupyterLab (and possibly other platforms like PowerBI) as a part of their capstone project too.

I'm sure there is enough amount of work that we all can collaborate on. I think it would be good to meet and brainstorm important issues and divide responsibilities soon after I come back from VIS (I'll be back 10/10).

Given you will come to UW on 16-17 for the Altair hackathon, it would be good for the UW capstone team to meet you in person too.

vega / voyager

Performance question about Voyager. #728