duckdb 1.29.0; self-host extensions

Fil commented 1 month ago

🎉 1.29.0! new version of DuckDB-wasm 🎉

https://github.com/duckdb/duckdb-wasm/releases/tag/v1.29.0

The repo had 296 commits since the last stable release a year ago. This is not including the commits on the linked DuckDB itself, which is now in version 1.1.1.

See https://duckdb.org/2024/09/09/announcing-duckdb-110 for the new features and changes in DuckDB. For example, the nice HISTOGRAM() function:

duckdb-histogram

The most notable new feature in duckdb-wasm is the support for extensions, in particular the "spatial" extension which includes the whole of GDAL, enabling geographic compute (projections, areas, etc), and introducing compatibility with dozens of new formats (shapefiles, excel sheets, etc.).

Other extensions: autocomplete, fts, icu, inet, json, parquet, spatial, sqlite_scanner, substrait, tpcds, tpch, vss.

1070
1508
1598 (maybe)
1009
1380
https://github.com/observablehq/framework/discussions/834

Fil commented 1 month ago

About the package size

Correlatively to the new features, this new release weighs a ton: the base files have doubled in size, and with the addition of extensions the binaries now take 153M of disk space on the server.

Fortunately this is not what the user has to download. First, depending on the browser used, they will only download the "mvp" version (older browsers) or the "eh" version of the wasm files, which is slightly more performant. Second, they will not load all extensions (and only "spatial" is quite big). Third, the wasm files are gzip'ed when transmitted to the browser.

But it's still doubling the (compressed) size of the base files from 4MB to ~8MB (depending on the extensions needed… here I compare 1.28.0 with 1.29.0+parquet). Is there a case to be made for users who would prefer to stay with 1.28 because of that? I prefer not to, since it would add much complexity.

About self-hosting extensions

A key feature of Framework is self-hosting. I didn't want to support 1.29 without self-hosting at least the extensions that used to be part of the monolithic 1.28 ("parquet", "json"). The status of extensions is however still a bit unclear to me. Some of the core extensions are built-in (such as httpfs). It's unclear how that list will change in the future (httpfs changed status during development, I think). So instead of linking to duckdb-wasm@latest, I thought it better to continue pinning the version, so as to prevent unexpected changes.

Moreover, only the core extensions are self-hosted for now, and we might want a path also for people who want to self-host community extensions (such as "h3") or custom extensions.

Maybe self-hosting "all the core extensions" is too much, and we could have a smaller list of extensions we self-host. However judging from the sizes of extensions, "spatial" dwarfs all the others—so it might not make sense to try and optimize disk space if we keep "spatial". Another option would be to make a configurable list of self-hosted extensions (including community and custom extensions). We would then have to pass that list to client/duckdb.js to install. More configuration means more complexity, though.

Fil commented 1 month ago

An alternative approach could be to publish a package on jsr or npm with the extensions we want to self-host.

mbostock commented 1 month ago

I guess my inclination is to have users explicitly list which DuckDB extensions they want, and where they come from. And then Framework can download them for self-hosting. So maybe in the config you would say something like:

export default {
  duckdb: {
    extensions: {
      json: "https://extensions.duckdb.org/v1.1.1/wasm_eh/json.duckdb_extension.wasm",
      parquet: "https://extensions.duckdb.org/v1.1.1/wasm_eh/parquet.duckdb_extension.wasm"
    }
  }
};

If we wanted to have shorthand, we could also allow something like:

export default {
  duckdb: {
    extensions: {
      json: true,
      parquet: true
    }
  }
};

Or even shorter:

export default {
  duckdb: {
    extensions: ["json", "parquet"]
  }
};

So Framework would self-host the specified files. And internally we’d have some resolution magic so that DuckDBClient knows where to find the self-hosted extensions. And if we’re allowing arbitrary URLs for extensions we’d need to use content hashing so that if the content of the extension changes it’s still immutably cached.

Fil commented 1 month ago

About the LOAD statements

Currently if several sql blocks use spatial functions (for example), you have to remember to type LOAD spatial in all of them, otherwise it's hard to predict which queries will run after it's loaded or run (and fail) before it's loaded. Only the first to run is actually loading it to the DuckDB instance, so it's a bit of a waste.

To avoid this issue we could maybe hoist any LOAD statement, so that you can have LOAD spatial in just one of your sql code blocks instead of having to repeat it in every block that needs this extension. This means static analysis of the sql code, but it's probably not too bad(?).

Or, maybe simpler, we could add a top-level config in front-matter. Something like:

sql:
  - load: [spatial, h3]
  -

or keep the sql key for tables, and add a new key for duckdb options

duckdb:
  - load: [spatial, h3]
  -

(we could also make it possible to reference an Excel or Shapefile dataset in front-matter, since spatial’s ST_Read function supports so many formats?)

mbostock commented 1 month ago

@Fil In my previous comment I meant that could be specified in the project config. But we could also let it be specified in the page front matter, overriding the project config if different pages want different extensions.

Fil commented 1 month ago

The config option would indicate which extensions are self-hosted and where they're sourced from. Thus, they would be INSTALLed from the self-hosted version. But INSTALL only tells duckdb where to find to the wasm binary, it doesn't actually load it to the browser.

For many core extensions this is happening implicitly, when duckdb recognizes that one the functions or file formats used belongs to a given extension (the extension is then said to be “auto-loaded”). The documentation in lib/duckdb.md shows this with the "inet" example. For other extensions, such as "spatial", you have to give an explicit LOAD statement before you can use any of the features.

Currently, when an extension needs to be loaded explicitly, it has to be mentioned in every sql code block, because their order of execution is not guaranteed. That's a bit too much, and I think the correct level to define these LOAD statements is the DuckDBClient instance—or more simply, the page.

I hadn't thought about loading all the configured/self-hosted extensions on all the pages, thinking that it should depend on what the page needs (e.g., for better performance on pages that don't need "spatial"). But I reckon this would make it easier to use, and maybe I'm overcomplicating things for the sake of the hypothetical project that might need an extension on a given page and not on another one. Maybe we should opt for simplicity.

(I'll play with the various possibilities to see how it feels.)

mbostock commented 1 month ago

Right, so the config could say whether to load the extension explicitly or to let it autoload if desired. But in either case the installing (and optionally loading) of any desired extensions would happen prior to the sql literal resolving so that downstream code can rely on the extensions being available.

Having equivalent extension registration for the front matter as for the project config makes sense.

Fil commented 1 month ago

Getting closer.

TODO:

[x] configuration to allow the "core" and "community" keywords
[x] decide which extensions are loaded and which aren't (typically, "json" and "parquet" don't need to be loaded since they're autoloaded)
[x] find a different way to pass the hash manifest (so that scripts can also use it)
[x] support mvp in extensions, or drop support for mvp globally
[x] allow per-client configuration (via DuckDBClient, not front-matter for now)
[x] bake the extensions manifest in the client js

mbostock commented 1 month ago

Lots here! Excited about this. I’ll try to help this week.

fabito commented 1 week ago

How do we use/enable the extensions support ? Do we need to wait for an official release or can we install the prerelease vesion ?

Fil commented 1 week ago

It's possible but difficult; my recommendation is to wait (a few days max) for the next release of Framework.

observablehq / framework

duckdb 1.29.0; self-host extensions #1734

1070

1508

1598 (maybe)

1009

1380