observablehq / stdlib

The Observable standard library.
https://observablehq.com/@observablehq/standard-library
ISC License
957 stars 83 forks source link

When reexecuting a cell with DuckDBClient.of stdlib downloads duckdb-eh.wasm everytime #368

Closed lfkpoa closed 1 year ago

lfkpoa commented 1 year ago

It's great to use DuckDBClient in Observable, but the download of duckdb-eh.wasm is 3.7 MB and it can take a little while. The wait at the start of the notebook is ok, but I have a use case that fetches different data depending on selections in other cells and Observable downloads duckdb-eh.wasm everytime it reexecutes the cell with DuckDBClient.of, and that takes a couple seconds. It would be nice if stdlib could detect that it has already downloaded duckdb-wasm and reuse it. An alternative would be a way to let the notebook interact with DuckDB to add new or update tables.

Thank you.

mootari commented 1 year ago

Do you have your browser cache disabled? The wasm should get served from the browser's cache on subsequent requests.

lfkpoa commented 1 year ago

The cache is enabled.
By checking the network tab of Developer Tools in the browsers Chrome and Edge, the file duckdb-browser-eh.worker.js was cached but duckdb-eh.wasm was not. I found some posts saying that some browsers do not cache large binary files. The file is not that large (3.3MB), but it is not being cached. But I decided to also test using Firefox and Firefox does cache duckdb-eh.was. Unfortunately there are many users here that use Chrome or Edge.

lfkpoa commented 1 year ago

I also found out that when using parquet files as FileAttachment to DuckDBClient, it registers it as a file and creates a view on it. The file is not imported and DuckDB uses range fetches to load parts of the file depending on the queries. That can be nice when the file is large, because it only fetches parts of the file. But it can end up fetching much more than the file size depending on the queries and the repetition. Maybe there should be a way to select if it is to be imported or referenced.

lfkpoa commented 1 year ago

I created a notebook to try out some possible alternatives based on the code of duckdb.js. For example, I added an static function createDB that allows to create a db in a separate notebook cell. This way it does not reload the duckdb wasm file again if I change the tables. I also added the function include that adds the tables to the database, exactly like DuckDBClient.of allowing to add tables after creating the client. Since the database is not recreated everytime I had to drop the tables if they exist before inserting them. https://observablehq.com/d/4bf4650566fccbb3

mootari commented 1 year ago

I still can't reproduce what you're describing. Can you point to a source that explains the caching limitation that you mentioned?

I tried the following:

  1. create a new notebook
  2. add a JS cell with the contents:
    DuckDBClient.of({penguins})
  3. rerun the cell several times

Both in Chrome and Firefox (macOS) the wasm was served from cache in all subsequent requests.

lfkpoa commented 1 year ago

Yeah, on macOS it is served from cache. I just checked. But on win10 it is not.

lfkpoa commented 1 year ago

I'm sorry. I guess this is related to some sort of group policy that is being aplied to our computers. I couldn't find any information about this but a friend tested this for me and duckdb-eh.wasm was cached. Thank you.