Load data for initial views from cache

domoritz commented 2 months ago

Right now, loading initial views can be slow. If the user never interact with the charts, we loaded the full data unnecessarily and could have just loaded the data for the initial view. See for example https://idl.uw.edu/mosaic-framework-example/gaia-star-catalog which right now downloads the full 5m Gaia dataset on load.

The idea is to load the data for initial views into the cache and then serve requests for the initial views from this cache. To achieve this, we could launch a mosaic view and cache queries as part of the compile step and then at runtime load the cache into duckdb-wasm. A few pieces are missing but I think there is a lot of potential in having instant initial views.

This comment from https://observablehq.com/blog/observable-2-0 makes the point about initial loading but the current setup doesn’t fulfill that vision.

Most importantly, Framework’s data architecture practically forces your app to be fast because data is precomputed. Performance is critical for dashboards: users don’t like to wait, and dashboards only create value if users look at them. Slow dashboards waste time.

To get started on this project, make an observable Framework dashboard, simulate a slow network (using the browser network tools throttling feature, and observe the slow load times).

alexkreidler commented 1 month ago

I extended some work from https://github.com/uwdata/mosaic/issues/383#issuecomment-2381096465 that gets Mosaic running in NodeJS using JSDom to cache the initial queries by wrapping the connector.query function, and writing the DuckDB results to a folder:

prebaked/
├── 2tP90IpN.arrow
├── AiIxuiwq.arrow
├── S70gkjFk.arrow
├── lnAhYbTN.arrow
├── out.html (static SVG output, not used by frontend right now)
└── queries.json (maps the SQL queries to the right arrow files)

I then tested running Mosaic from the cached data in Observable Framework: https://github.com/alexkreidler/mosaic-framework-example/tree/cache-initial-view-queries

The connector on the frontend (in voting-cached.md) loads the query results directly from arrow files and doesn't even load DuckDB -- this significantly improves the initial loading times (when throttling to 3g the cached version loads in about 3 seconds compared to 57 seconds with the non-cached version). The tooltips work too! But this is pretty hacky and we'd need to load DuckDB eventually/in the background.

Screenshot 2024-09-29 at 15-07-20 Cached_pre-baked voter registration statistics Mosaic Framework

I can imagine this with an API like

const coordinator = new vg.Coordinator();
const baseConnector = vg.wasmConnector(); // or socket or rest
const staticDir = "/_file/data/prebaked" // for Observable Framework, would be /public or similar for Next, Vite etc
const connector = wrapConnectorWithCache(staticDir)(baseConnector);
coordinator.databaseConnector(connector);

You may also want to have this caching be somewhere else rather than at the connector level of the API. But it is promising that we can get Mosaic running in NodeJS so we can record the SQL queries and save their results to a cache somewhere when building a static site.

domoritz commented 1 month ago

Very cool that you figured out how to create precomputed results using jsdom. I think that faster load times are a really big improvement for Mosaic and would love to have some functionality in the core. I like that we can use the cache and don't even need DuckDB. What we will need to figure out is clear feedback for users when they start interacting with charts and things are not fully loaded yet.

In terms of implementation, I think we should combine this with the bundling logic (e.g. https://github.com/uwdata/mosaic/blob/9324986c84d8269e668e5314660dd6a752aba023/packages/duckdb-server/pkg/bundle.py#L61). What do you think?

alexkreidler commented 1 month ago

I think that's a great idea!

I updated my observable-framework code to use the mosaic NodeJS server, send a create-bundle request, copy the resulting bundle to the Observable Framework public dir, and then serve queries from the static bundle files, and it works well! As long as we can load the bundle into the WASM connector or socket/rest connectors, then we can serve the initial view queries from the cache and lazy load the bundle in the background.

I ran into a small issue where the type of the file saved in the bundle doesn't match the type requested by Mosaic (instead it is based on a regex in the create-bundle code), but it was an easy enough fix to find the right file based on the SHA256 hash of the query. This is fine because the code receiving the results from the connector can accept JSON objects or arrow tables for any query type. I also noticed the typings for Connection.query aren't right for the create or load bundle commands. E.g. while the correct format is await conn.query({ type: "create-bundle", name: "line-bundle", queries }); my editor says I'm missing the sql field.

We need a way to load a bundle into the WASM connector, which currently doesn't support them. We probably want a way to load "remote bundles" into the Python, Rust and Node servers so they can load bundles that someone else published or were pushed to object storage by a build process. A small concern is that the create-bundle / load-bundle code is duplicated in three languages and it may be tough to keep it in sync.

I agree we should have a way to show users whether the plot is interactable (or give them an explanation if they interact but nothing happens). We could add a new function to Connector like status(): "cache" | "loaded" that indicates that only pre-cached queries are supported or all queries are supported, respectively. This could maybe be combined with Plot status value/events (#157).

Another question: should we have a way to configure if the source data is included in the bundle or if it should be fetched at runtime? For example if it is a large parquet file that updates fairly frequently. Also, how would we handle initial view caches that are out of date? One idea is for the connector to save all the queries that are served from the cache, re-run them against the newly loaded source data, check if the results are different, and then indicate that the data is stale, or call Plot.render.

domoritz commented 1 month ago

Very nice!

I also noticed the typings for Connection.query aren't right for the create or load bundle commands.

Can you point me to that?

A small concern is that the create-bundle / load-bundle code is duplicated in three languages and it may be tough to keep it in sync.

I agree. One idea would be to simplify the bundles to make it easier to maintain them across languages or have the mosaic coordinator do most of the logic (rather than the server). Another could be to figure out a way to put the cache into duckdb itself so that loading a bundle would be as easy as calling an attaching a specific database. Re-considering how we do bundling is definitely on the table.

alexkreidler commented 1 month ago

Yeah, currently the typings are JSDoc comments that Typescript interprets as having a required sql field (also I realize I wrote Connection.query instead of Connector.query): https://github.com/uwdata/mosaic/blob/main/packages/core/src/connectors/rest.js#L12 https://github.com/uwdata/mosaic/blob/main/packages/core/src/connectors/socket.js#L92 Now that you merged #545, we could type this as a discriminated union.

I like the idea of putting the cache logic in the coordinator. It seems like it would be easy to run the queries to create a bundle. But it does seem harder to load the bundle because it needs to read those files into memory and update the cache, which is in the browser for WASM but on the server for the other connectors right now. Loading from a .duckdb file could be great because it supports loading from remote files over HTTP and all the tables would be automatically created. But I don't think we should get rid of the static bundle idea altogether because I think it is super useful for serving Mosaic visualizations if DuckDB is still loading or is not available on a specific device. One option could be that a bundle has both the static arrow files for quick rendering without DuckDB, and the DuckDB file for easy loading into the database. Once DuckDB is loaded the cache would actually query a pre-loaded DuckDB table for a given SQL query, which would hopefully have similar performance to the in-memory cache.

domoritz commented 1 month ago

Good point. Does https://github.com/uwdata/mosaic/pull/549 look right to you?

domoritz commented 1 month ago

But it does seem harder to load the bundle because it needs to read those files into memory and update the cache, which is in the browser for WASM but on the server for the other connectors right now.

There is also a cache in the browser. We have multiple levels of caching (and might want to reconsider some as we probably duplicate the cache in wasm).

alexkreidler commented 1 month ago

@domoritz Yeah #549 looks good!

jheer commented 1 month ago

What we will need to figure out is clear feedback for users when they start interacting with charts and things are not fully loaded yet.

Right now this is left as a responsibility of MosaicClient implementations. The coordinator will call "queryPending" and the client can do with it what it will. I think this is the right abstraction, as clients can be arbitrary (they need not even correspond to visual UI components). But at minimum I think the reference components in vgplot and inputs could make better use of that. So I think adding some kind of pending indicator in the UI for those components could be a great start.

domoritz commented 1 month ago

Filed https://github.com/uwdata/mosaic/issues/550

uwdata / mosaic

Load data for initial views from cache #506