nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
105.01k stars 28.43k forks source link

Expose SQLite? #53264

Open benjamingr opened 3 weeks ago

benjamingr commented 3 weeks ago

We agreed (the TSC) to allow SQLite into the project for the localStorage API at https://github.com/nodejs/node/pull/52435

It would be neat if we exposed that to userland similraly to how Bun does, there is a lot of prior art of languages and runtimes embracing SQLite for a local configuration database like python, php, Tcl and others.

This issue is to discuss it and raise concerns about whether or not it makes sense to expose SQLite bindings given we already bundle it.

cjihrig commented 3 weeks ago

I imagine this idea will get a lot of support and a lot of opposition. If someone else handles the consensus building aspect, I'd be happy to work on the implementation side.

gabrielschulhof commented 3 weeks ago

Shall we use the same interface as https://github.com/TryGhost/node-sqlite3?

tniessen commented 2 weeks ago

@gabrielschulhof On the rare occasion that I voluntarily write JavaScript code, I personally prefer the better-sqlite3 API, which also tends to be significantly faster.

That being said, the non-blocking behavior of node-sqlite3 better matches the Node.js I/O model. If you have a slow storage medium or if you use a slow VFS, e.g., with a network resource as its backend, then the asynchronous node-sqlite3 API is a better choice. Similarly, applications that do not utilize SQLite transactions efficiently will spend a lot of time waiting for each change to be committed, which is slow even if the underlying storage is fast, so using synchronous APIs in those cases unnecessarily blocks the event loop. Applications that use well-designed transactions and that run on a very fast storage backend, on the other hand, will see a lot of unnecessary overhead from asynchronous APIs.

Perhaps Node.js could allow native addons to link against the bundled SQLite dependency instead and leave the API to third-party packages.

GeoffreyBooth commented 2 weeks ago

I think at the very least let's land the localStorage PR which would get SQLite into the codebase. After that, my concern would be semver: if we expose SQLite, then we can't update it any faster than our release cycle, unless we make some kind of exception for it.

cjihrig commented 2 weeks ago

FWIW, SQLite has followed semver since version 3.9.0 in 2015, and the current major version is still 3:

The current value for X is 3, and the SQLite developers plan to support the current SQLite database file format, SQL syntax, and C interface through at least the year 2050. Hence, one can expect that all future versions of SQLite for the next several decades will begin with "3.".

KhafraDev commented 2 weeks ago

I would like to bring attention to esqlite by @mscdex, which outperform(s/ed) better-sqlite3 in a number of cases while being asynchronous. https://github.com/mscdex/esqlite

benjamingr commented 2 weeks ago

It looks like we have consensus around experimenting with this (i.e. prototyping a sqlite binding in core), just for further visibility I'll also ping the TSC in chat.


As for the API I personally really prefer a synchronous one over an asynchronous one since SQLite is built to run in-process. I think in the common use case this is significantly simpler and likely faster. That said, an asynchronous promise based API should also be fine.

Funnily enough 2 people asked me about this feature today - a good luck omen it's something users find interesting.

tniessen commented 2 weeks ago

We already see different opinions on the API. If there is no "one size fits all" API, should we have one in core? If we make it possible for native addons to link against the SQLite dependency that's embedded in Node.js, then that should let third-party packages define whatever API they want without having to ship SQLite.

H4ad commented 2 weeks ago

We already see different opinions on the API. If there is no "one size fits all" API

Currently the different opinions are basically sync vs async.

The basic API could be something like this

class Database {
  // path can be :memory: or the file path
  constructor(path: string, options?: { readonly?: boolean, create?: boolean }) {}

  query(statement: string): Statement; // works like prepare, like bun is doing right know
  open(): void; // maybe we can just open automatically? and add a option to disable auto-open
  close(): void;
}

type AllResult = Array<any>;
type RunResult = void; // maybe boolean?
type GetResult = any;

class Statement {
  all(...params: any[]): Promise<AllResult>;
  run(...params: any[]): Promise<RunResult>;
  get(...params: any[]): Promise<GetResult>;

  // we can also introduce sync methods, like fs
  allSync(...params: any[]): AllResult;
  runSync(...params: any[]): RunResult;
  getSync(...params: any[]): GetResult;
}

Sync has advantages, and async too, so ship both and be happy.

In the first iteration, we can stick with the easiest/simplest implementation, which is probably the sync version. So we can iterate and add the asynchronous versions and then add more methods (pluck, iterate, values, etc...)

EDIT: I didn't talk about transactions because it didn't even matter in this first iteration, is better to discuss this method when we have at least the basic methods, then we will have more knowledge on what could be the best API for exposing transactions.

If we make it possible for native addons to link against the SQLite dependency that's embedded in Node.js, then that should let third-party packages define whatever API they want without having to ship SQLite.

The point is that you don't need to use third-party packages to be able to use sqlite, but you can still do that even if we have our own API.

benjamingr commented 2 weeks ago

We already see different opinions on the API

We've had very little api discussion so far, I think it's fine to talk about stuff and see if we can reach consensus?

GeoffreyBooth commented 2 weeks ago

For reference:

benjamingr commented 1 week ago

@tniessen looping back to your comment here: https://github.com/nodejs/node/issues/53264#issuecomment-2156020758 I tend to think the better-sqlite3 sync API is better for most cases and given SQLite's execution model and the motivation of users asking to add it to Node.js (not as a DB server replacement but for local configuration, at least that's what I've been hearing). I see your point though.

There is interesting discussion here: https://sqlite.org/forum/forumpost/7c90893579 as well.

I think it may be a good idea to reach out to the SQLite team and ask, so I went ahead and opened https://sqlite.org/forum/forumpost/96f9f78fc1

cjihrig commented 1 week ago

I think it makes the most sense to do something similar to the fs module here and have synchronous and asynchronous (Promise based, no callbacks) APIs. For the majority of cases, synchronous APIs are good enough, but there are almost certainly cases where async will be preferred (and it is more Node-like). I have started writing some code for the DB connection already.

EDIT: Very early, but here is the branch.

cjihrig commented 1 week ago

Another thing I think is worth discussing is whether or not it makes sense to expose this as its own module (node:sqlite) or as part of another module such as node:util.

asg017 commented 1 week ago

One note: if you do decide to include SQLite in Node.js, please statically compile an up-to-date SQLite library instead of relying on the host's SQLite library!

This is a problem with Python's SQLite library and Bun's SQLite library: By default they use whatever SQLite version the OS has installed, which leads to headaches. Some default SQLite builds omit features like SQLite extensions, JSON, or math functions. Or they are stuck on old version of SQLite which lack window functions, JSONB, datetime utils, or STRICT tables. This is even documented in Bun's docs and Python's docs.

So if you statically compile a recent SQLite version, say 3.46, you'd have full control of what SQLite features are available across different platforms. And if someone wanted to use a different/more up-to-date SQLite version, they can always reach for better-sqlite3 or another 3rd party package. This is what we do for Datasette, which is in Python - we use the standard Python SQLite library and tell people to use the 3rd party pysqlite3 package if their builtin SQLite is too old.

benjamingr commented 1 week ago

@asg017 Node bundles 3.46.0 for web storage support already so presumably that's the one that will be used in this case.

asg017 commented 1 week ago

Also, regarding sync vs async: the better-sqlite3 README says:

node-sqlite3 uses asynchronous APIs for tasks that are either CPU-bound or serialized. That's not only bad design, but it wastes tons of resources. It also causes mutex thrashing which has devastating effects on performance.

I've never benchmarked it myself, but this is the sole reason why I always use better-sqlite3 over node-sqlite in my projects.

I know a lot of Node.js developers would expect an async API (like node:fs/promises), but if they knew of potential performance problems, I'd imagine they'd let it slide.

glommer commented 1 week ago

Just to add my 2-cents, sync makes sense for simple queries. But analytical queries, things like computeing averages and sum can take a fair amount of time.

Given that some users will want to use this for things like dashboards, that can issue a bunch of aggregation queries, an async API would make more sense.

cjihrig commented 1 week ago

I agree with that point from better-sqlite3. A counterpoint is that sqlite might not be the only thing the application is doing. But if a sqlite call is blocking, then nothing else can happen. This is why both sync and async APIs should be provided.

benjamingr commented 1 week ago

This is also my conclusion based on https://sqlite.org/forum/forumpost/96f9f78fc1

benjamingr commented 1 week ago

I'll also escalate the questions to the TSC chat to make sure we have consensus.

legendecas commented 1 week ago

Another thing I think is worth discussing is whether or not it makes sense to expose this as its own module (node:sqlite) or as part of another module such as node:util.

Any reason to put it in node:util?

tniessen commented 1 week ago

This is also my conclusion based on https://sqlite.org/forum/forumpost/96f9f78fc1

FWIW, that discussion also raised this point:

it's our collective opinion that there neither can be, nor should be, One True Wrapper (...) Having multiple wrappers is, in our considered opinions, healthy for the ecosystem and provides the most all-around flexibility when selecting which set of trade-offs (compared to the C API) a project wants to use.

This reinforces my opinion that we should make it easy for third-party packages to link against the bundled version of SQLite, assuming we are committing to shipping SQLite forever anyway.

tniessen commented 1 week ago

With asynchronous APIs, I think it's important to very carefully design a good transaction interface. better-sqlite3 and Bun's SQLite API use wrapped functions for that, which is somewhat reasonable in a purely synchronous setting. node-sqlite3, on the other hand, relies on the user manually distinguishing between calls that may run in parallel and call that may not. Other bindings, such as Rusqlite have created much better interfaces in my opinion, but those don't tend to work that well in a messy language like JavaScript.

benjamingr commented 1 week ago

This reinforces my opinion that we should make it easy for third-party packages to link against the bundled version of SQLite, assuming we are committing to shipping SQLite forever anyway.

Yes and that sounds like a good idea. That does not preclude us from shipping our own wrapper as well. I think a major theme in recent Node features (the test runner, task runner, watch mode, util.parseArgs etc) isn't to be the best or have the richest functionality in those spaces - it's to provide sensible and safe defaults for most projects.

I think no matter what API we ship, userland will still provide alternatives and projects using sqlite heavily for more than configuration are likely to still use those wrappers. (Similarly to projects needing browser testing, advanced task runner features, rich argument parsing, more fs convenience methods etc).

So I am not that concerned about our API not being "feature complete"

Other bindings, such as Rusqlite have created much better interfaces in my opinion, but those don't tend to work that well in a messy language like JavaScript.

Assuming you had no messy language constraints whatsoever - what API would you pick?

I personally think we might want to stick with disposers (Symbol.dispose) and not the disposer pattern - but honestly even if we have very rudimentary transaction support and leave richer APIs to userland I'm fine with that.

cjihrig commented 1 week ago

Any reason to put it in node:util?

Not particularly. A new core module is a bigger undertaking - particularly since the decision to require the node: scheme - so I thought it was at least worth raising the point. I'm fine either way.

mcollina commented 1 week ago

I personally recommend using most of the amazing API of https://github.com/WiseLibs/better-sqlite3, i.e. use synchronous APIs. They generically offer better performance with SQLite on modern HW. I'm not opposed to have sync and async version of those.

cc @JoshuaWise @mceachen for visiblity.

mceachen commented 1 week ago

Proper SQLite support within Node.js would be splendid!

A couple thoughts:

  1. I’d suggest not skipping a Transaction API in the first iteration: many use cases require ACID behavior.
  2. PRAGMA support is another possibly-surprising feature of SQLite that is a hard prerequisite for reasonable performance and stability for many use cases.
  3. A codebase that uses both sync and async calls concurrently could be prone to deadlock/busy timeouts, especially when mixing different transaction flavors (like BEGIN DEFERRED in the sync call, and BEGIN IMMEDIATE in the async call). A “solution” to this issue could simply be a warning in the documentation to not do that.

I like the idea of using better-sqlite3’s API wholecloth, as that could mean switching between the “native” node api, bun, and better-sqlite3 would be a simple import twiddle. Async API support would break this, of course. Edit: as @cjihrig mentions below, this already isn't the case: I think better-sqlite3 API's main value-add for this effort should be informative with respect to what SQLite functions should be considered. Also know that better-sqlite3's API is not a comprehensive set of sqlite3's API.

One suggestion from the peanut gallery: I’d suggest this API live in a node:sqlite package (like bun:sqlite) rather than node:util, just to avoid the “util is a dumping ground” trope.

benjamingr commented 1 week ago

One suggestion from the peanut gallery: I’d suggest this API live in a node:sqlite package (like bun:sqlite) rather than node:util, just to avoid the “util is a dumping ground” trope.

This seems to be the consensus so far.

cjihrig commented 1 week ago

I like the idea of using better-sqlite3’s API wholecloth, as that could mean switching between the “native” node api, bun, and better-sqlite3 would be a simple import twiddle.

I'm not trying to knock the better-sqlite3 API at all, but my understanding is that this is already not the case. The Bun docs mention that their API was inspired by better-sqlite3, but they already seem to have diverged from it some (and probably will continue to do so). That approach also would not offer any compatibility with other sqlite modules like Deno's or node-sqlite3 (which has roughly twice the weekly downloads according to npm).

cjihrig commented 1 week ago

Another question regarding the API - if we were to start off following the better-sqlite3 API and wanted to add async support in the future, what would that look like? Node has generally had async APIs, and appended Sync to the name for synchronous equivalents. The better-sqlite3 API would sort of flip that naming pattern around.

benjamingr commented 6 days ago

@cjihrig honestly since naming is very bikesheddy in this case I think you should just pick one. Here is an example color for our bike shed:

But I think that's not one of the things that will prevent a PR from landing so I wouldn't worry too much about it.

JoshuaWise commented 5 days ago

I have a few thoughts on this.

Thoughts on an appropriate API

I designed better-sqlite3 to be easy to use for people who have never used SQLite's C API. I tried to capture the higher-level concepts into the simplest API I could think of, while still exposing most of what advanced users would want via options and utility methods. That said, I don't necessarily think that philosophy aligns with existing APIs within Node.js core. For example, node:fs is actually pretty hard to understand unless the user is already familiar with the C API of Unix/Linux-style filesystems (with the exception of a few high-level convenience methods like fs.readFile). Of course, for something built into the runtime, that low-level exposure makes sense. Therefore, I do think it would be appropriate for the API to (as much as possible) be a 1-to-1 mapping with SQLite's low-level C API, but with a few convenience methods built-in. I see no reason to depart from the method/methodSync naming convention used by Node.js's core libraries.

However, it should be noted that many asynchronous APIs would simply break or result in undefined behavior or unsafe memory accesses when combined with much of the behavior provided by SQLite. For example, SQLite allows you to register user-defined SQL functions in the host language (i.e., JavaScript in this case). These function would be invoked during certain SQL queries, but if those SQL queries were executed in a thread pool, it will result in unsafe cross-thread usage of JavaScript contexts (which is very, very bad). Similarly, transactions become prone to error and terrible performance when performed in an asynchronous manner, since transactions in SQLite lock the entire database. Therefore, any asynchronous API for SQLite in Node.js would need to be highly limited, exposing only the most basic functionality that SQLite offers.

To safely separate the synchronous and asynchronous worlds, I recommend exposing two different classes, SQLiteDatabaseAsync and SQLiteDatabaseSync, each of which represents a connection to a SQLite database, with SQLiteDatabaseSync providing access to the entire (or almost the entire) SQLite C API, while SQLiteDatabaseAsync would only provide basic functionality like making queries. The goal here is to make it impossible for a single database connection to use both async/non-blocking methods AND advanced SQLite APIs that would have detrimental effects when invoked from the libuv thread pool.

I believe an approach like this would align with the existing patterns/philosophy of Node.js's core libraries, and would satisfy the memory safety that is expected of the Node.js runtime. Sprinkle in a few high-level convenience methods, and let user-land do the rest.

Thoughts on SQLite versions

Perhaps most importantly (much more important than the API, in my opinion), is how to deal with the issue of SQLite versions. Besides the fact that the SQLite developers release new versions many times per year, SQLite is designed to be customized at compile-time, giving the user the power to enable certain advanced features, disable deprecated features that may harm performance or be unsafe, or even change the behavior of basic SQL expressions (like case-sensitivity, unicode, etc.). In better-sqlite3, it's very common for users to need a different build of SQLite than the one provided by default. I do provide a way for users to compile a custom version of SQLite, but the process is relatively painful. If Node.js exposed SQLite to users, I think it would be crucial to solve this need. Perhaps it can be solved with dynamic linking or something. Just be aware that you will inevitably run into this.

punkish commented 5 days ago

it's very common for users to need a different build of SQLite than the one provided by default

I second this assertion of @JoshuaWise and think it is very important to build this capability in the node's integrated version of SQLite. I want the same (possibly customized) version of SQLite powering my node application as well as my CLI so there are no surprises when I am building my application or analyzing my data. For this reason itself, it is very important that not only can I customize my SQLite with the extensions I need, but do it in multiple locations, and as easily as possible because I am not a C programmer. Even the most obvious steps for a C programmer can be very confusing for me.

asg017 commented 5 days ago

I have a slightly different view: node:sqlite should be a no-nonsense "just works" SQLite package that application developers use when they want to read/write to a SQLite database. They would mostly care about:

  1. Having an up-to-date SQLite version
  2. Ability change a few settings (like PRAGMA and runtime configuration options)
  3. Using SQLite extensions to add new functionality (either as application defined functions or dynamically loadable extensions)

If someone wanted something outside of that, they should use a 3rd party package like better-sqlite3. This would include:

This is kindof how Python works: the builtin sqlite3 library is a "just works" version of SQLite, while if someone wanted a customized build, they'd need to reach for a 3rd party package like pysqlite3.

That means that the SQLite library offered in node:sqlite needs to precisely define which compilation arguments to include and default settings, which is difficult. But I think offering an API that can dynamically handle multiple SQLite libraries would be more difficult

tniessen commented 5 days ago

If someone wanted something outside of that, they should use a 3rd party package like better-sqlite3. This would include:

  • A different SQLite version
  • Custom compilation arguments
  • A completely different SQLite library (ie sqlciper or libsql)

FWIW, the SQLite developers strongly recommend against having more than a single copy of SQLite per process, see How To Corrupt An SQLite Database File.

cjihrig commented 5 days ago

FWIW, the SQLite developers strongly recommend against having more than a single copy of SQLite per process

That seems like it would raise a more immediate question about WebStorage and its impact on code already using a userland sqlite module. How do Deno/Bun/Python etc. deal with that?

asg017 commented 5 days ago

@tniessen ooh thanks for sharing, hadn't seen that before!

But in this case, if Node.js statically links a "golden" SQLite library for node:sqlite, would other dynamically linked SQLite libraries from 3rd party npm packages (better-sqlite3, libsql, sqlcipher, etc) be effected as well? It's hard to tell from that link, they only say "linked" but don't specify if this would happen in dynamically linked version of SQLite with a statically linked version.

My hunch is that there could be only 1 statically linked copy of SQLite, but other dynamically linked SQLite copies would be fine? Especially since this pattern is used in Python/Deno (which uses SQLite for localStorage + KV), but I'm no expert here

GeoffreyBooth commented 5 days ago

would other dynamically linked SQLite libraries from 3rd party npm packages (better-sqlite3, libsql, sqlcipher, etc) be effected as well?

I would think that Node’s internal SQLite would use different files on disk for storing the database than the files on disk used by these various packages? And if that’s the case, then I don’t see why they would interfere with each other.

punkish commented 5 days ago

right, as long as I continue to use the SQLite driver with which I created my db, I would not expect any problem. So, if I use node:sqlite to create my files, I continue to use it with those files. And, if I use a custom version of SQLite with, say, better-sqlite3, I would continue to happily use that combo. Using the correct driver with the files they created would be my responsibility