Meta VFS for adding storage options

rhashimoto commented 3 years ago

UPDATE: Code links in this comment are stale. Check the thread for the latest info.

I have an experimental sql.js branch with changes that enable defining a SQLite VFS in Javascript. It's a platform for creating a VFS, so I've been calling it a meta-VFS. The idea is to enable more storage options in the browser, beyond the current in-memory filesystem, as has been suggested a number of times, e.g. #302.

The branch doesn't define any VFS itself, but in a separate repo I have examples of creating an in-memory VFS and an IndexedDB VFS (using it as a block device) with only Javascript. Neither is a huge amount of code, less than 350 lines for IndexedDB and no function over a page, so I think not difficult to understand.

None of this is production-ready code (mainly because of lack of testing), and the branch is not currently suitable for a pull request. But I wanted to start a discussion on whether adding something like this makes sense, and if it does how to incorporate it.

I know that the maintainers have limited time and adding multiple storage options to the code base isn't going to help with that. So my idea is to provide only the capability for storage extensions, i.e. just the meta-VFS, to allow developers to create the extensions themselves outside the scope of the project.

There are some drawbacks to this idea, among them:

Some browser storage options have only asynchronous APIs (e.g. IndexedDB), which require using Asyncify and its signficant space and time costs. It is likely that both Asyncify and non-Asyncify builds would be needed, so at least one more build than provided now.
SQLite is sort of a worst case for Asyncify. The .wasm file is nearly twice the size and the execution cost is said to be ~5x (plus any cost for slower storage). This may not be acceptable for many applications, and so may not be worth the effort to provide it. The future might bring native coroutines/continuations to WebAssembly, which we could just wait for.
The current sql.js application API is synchronous. An asynchronous application API is needed to use asynchronous storage. This could be a substantial effort all by itself (and one I would rather leave to someone else).
Splitting responsibility for a platform and its extensions can lead to support issues, like finger pointing when problems occur.

Any thoughts, especially from maintainers?

(As an aside the above mentioned branch uses Emscripten modularization to build ES6 modules, so perhaps of interest for #284 and #438.)

phiresky commented 3 years ago

Can you say how you think using the SQLite VFS Api compares to using the Emscripten filesystem api?

I ask because I just built a VFS that fetches chunks of the database using HTTP range requests using just the emscripten filesystem api and it works fine with only a few changes to sql.js: https://github.com/sql-js/sql.js/commit/837d25db2fa248033016f08cd395989a0794ad18

The emscripten FS api is very similar to the sqlite one, it's just one level lower in providing a POSIX api for sqlite instead of the sqlite-specific api.

demo: https://phiresky.github.io/youtube-sponsorship-stats/?uploader=Adam+Ragusea

It seems like Emscripten also provides an IndexedDB file storage backend that could maybe be used in a similar fashion? https://emscripten.org/docs/api_reference/Filesystem-API.html#filesystem-api-idbfs

rhashimoto commented 3 years ago

@phiresky I haven't written an Emscripten FS, so this is based on my understanding of it.

As you note, the SQLite VFS interface is higher level than the Emscripten FS interface, so I think it's a little easier to understand and implement. A more fundamental difference, however, is there is no way I know of to use an Emscripten FS to bridge synchronous calls to an asynchronous API, whereas SQLite VFS can use Asyncify.

I assume that to satisfy this synchronous requirement you use synchronous XMLHttpRequest, is that right? Sync XHR can only be used in a Worker. That's not a horrible restriction as SQLite should probably only be deployed for production in a Worker anyway, but it is an extra step for a developer who is just experimenting.

It's great that you can use sync XHR; that's fine for HTTP requests. One might think that we could use sync XHR as a bridge to any asynchronous API by installing a Service Worker. That would be wonderful, but unfortunately at least Chrome and Safari don't support this.

Emscripten does provide IDBFS, but because of the synchronous restriction, it is actually MEMFS with manual two-way mirroring to IndexedDB. Or viewed another way, it's IndexedDB with everything cached in memory. So it won't support a filesystem larger than available memory, even if no files are actively being used. In addition, I believe that the current implementation of IDBFS stores the entire file as an object, so sync will slow down as the files grow even if the changes are small (this behavior could be changed with a reimplementation).

I don't believe one can implement a blocking lock with an Emscripten FS (other than with sync XHR to an external server). That means that if you want to support concurrent access to an IDBFS database, e.g. a user has two tabs open on your page, you have to arbitrate above the filesystem level. You can hack together a locking scheme at the library level, but how do you know whether you need a shared read lock or an exclusive write lock? It's a lot easier and more elegant to handle locks exactly when SQLite tells you it needs them and why.

My toy IndexedDB VFS will support a database up to the browser storage limit, which could be larger than available memory. It doesn't support locking either but that's a design choice - the framework itself doesn't exclude it. I think adding support for concurrent access from same-origin contexts is quite doable.

Many applications can live within the constraints of Emscripten FS - that is not in dispute - but extending via the SQLite VFS opens potentially useful possibilities outside those constraints. If you want to mount over a WebSocket or encrypt with WebCrypto, for example, I don't know how you do that with Emscripten FS. I also think that the SQLite VFS is conceptually simpler, and so is more accessible (e.g. easier to read existing implementations) because it is higher level and only includes what SQLite uses.

phiresky commented 3 years ago

Thanks for the detailed response. By the way, did you try using Atomics.wait() with a separate worker for communication instead of asyncify?

rhashimoto commented 3 years ago

By the way, did you try using Atomics.wait() with a separate worker for communication instead of asyncify?

I have not tried Atomics at all. I generally try to avoid features not supported across the evergreen browsers, but because you mentioned it I see that the holdout Safari has it in preview behind a flag so hopefully that will be coming soon.

It seems like there might be a nice development path in initially creating a VFS with Asyncify, debugging and testing in a single context, then moving the VFS class to a separate worker and replacing it with a proxy that no longer needs Asyncify (or never moving it if performance isn't an issue).

phiresky commented 3 years ago

I just tried it and using Atomics for async stuff (without asyncify) works!

In my case I'm trying to write a virtual table to access the DOM right now, which has to be async since my SQLite is running in a Worker but the DOM is only accessible in the main thread.

Here's some (shitty) code:

sqlite.worker.ts

// sends a request to the main thread via postMessage,
// then synchronously waits for the result via a SharedArrayBuffer
function doAsyncRequestToMainThread(request: MainThreadRequest) {
  // todo: dynamically adjust this for response size
  const sab = new SharedArrayBuffer(1024 * 1024);
  // first element is for atomic synchronisation, second element is the length of the response
  const metaArray = new Int32Array(sab, 0, 2);
  metaArray[0] = 1;
  // send message to main thread
  self.postMessage({
    action: "eval",
    notify: sab,
    request,
  });
  Atomics.wait(metaArray, 0, 1); // wait until first element is not =1
  const dataLength = metaArray[1];
  console.log("got response of length", dataLength);
  // needs to be copied because textdecoder and encoder is not supported on sharedarraybuffers (for now)
  const dataArray = new Uint8Array(sab, 2 * 4, dataLength).slice();
  const res = new TextDecoder().decode(dataArray);
  return JSON.parse(res);
}

main.ts

worker.addEventListener("message", async (e) => {
if (e.data && e.data.action === "eval") {
    const metaArray = new Int32Array(e.data.notify, 0, 2);
    const dataArray = new Uint8Array(e.data.notify, 2 * 4);
    const response = JSON.stringify(await handleRequest(e.data.request));
    const text = new TextEncoder().encode(response);
    dataArray.set(text, 0); // need to copy here because:
    // sadly TextEncoder.encodeInto: Argument 2 can't be a SharedArrayBuffer or an ArrayBufferView backed by a SharedArrayBuffer [AllowShared]
    // otherwise this would be better:
    /*const encodeRes = new TextEncoder().encodeInto(response, data);
    if (encodeRes.read !== response.length) {
    console.log(encodeRes, response.length)
    throw Error(`not enough space for response: ${response.length} > ${data.byteLength}`);
    }*/
    metaArray[1] = text.length;
    Atomics.notify(metaArray, 0);
}
});

There's some restrictions:

SharedArrayBuffer only works on https and with some weird headers set
kinda hard to implement robustly
needs two threads, e.g. main thread + worker thread

Now I can do this:

So useful 🤣

lovasoa commented 3 years ago

Hello ! First, thank you all for the great work ! I wasn't up to date on SharedArrayBuffers and Atomics, and your last message blew my mind. I didn't think there was a way to synchronously wait for an asynchronous event. This opens up so many possibilities.

@rhashimoto I am willing to merge a pull request with your VFS interface once it's fully developed, tested, and documented. And I do agree that the interface and its various implementations for different storage backends should be maintained separately.

Of course, having a synchronous IndexedDB API available in web workers would make everything infinitely easier. MDN states that

[IndexedDB's] synchronous API was intended for use only with Web Workers but was removed from the spec because it was unclear whether it was needed. However, the synchronous API may be reintroduced if there is enough demand from web developers.

I am not sure to who this interest should be formulated, but I think we should at least make ourselves known from the standard committee in charge of that.

And I do agree that using Emscripten's IndexedDB backend doesn't make much sense. Its behavior can easily be replicated by library users without any change in sql.js itself.

rhashimoto commented 3 years ago

Here are some timings for various VFS combinations. For each combination I did some simple operations:

Create a database
Create a table
Insert 1300 rows
Sum over one of the columns
Close the database
Open the database again
Sum over one of the columns again
Close the database again

I used three VFS implementations. "unix" is the default in-memory filesystem that SQL.js already uses. "mem" is a synchronous in-memory VFS I wrote, and "async mem" is an asynchronous in-memory VFS I wrote.

I tested with two builds, one regular build that only permits synchronous VFS, and one with Asyncify that allows either synchronous or asynchronous VFS.

I noticed that a lot of the filesystem traffic is for the journal file, so I also tested with PRAGMA journal_mode settings DELETE (default) and MEMORY.

Here are average times over 100 iterations:

VFS	Build	Journal Mode	Time (ms)
unix	std	DELETE	359
		MEMORY	260
	Asyncify	DELETE	537
		MEMORY	408
mem	std	DELETE	232
		MEMORY	205
	Asyncify	DELETE	394
		MEMORY	347
async mem	Asyncify	DELETE	2884
		MEMORY	1281

I see these interesting things in the results:

The Asyncify build is slower than the standard build with any VFS (roughly 1.6x for a synchronous VFS), but the asynchronous VFS (which will only run with Asyncify) was vastly slower than anything else.
Everyone should probably use PRAGMA journal_mode = MEMORY (or OFF if your usage allows it) with any in-memory database.
My synchronous in-memory VFS turned out to be faster than the standard VFS. I didn't expect that; I thought it would be superfluous except as an example. I'm not sure whether the extra cycles are on the SQLite side or the Emscripten side or both.

rhashimoto commented 3 years ago

I created a demo page with several SQLite VFS combinations of filesystems (default, memory, IndexedDB) and builds (standard, Asyncify). It seems to work on Chrome, Firefox, and Safari (haven't tried Edge). Note that it restores your SQL from the previous session and you can execute a portion of your SQL by highlighting it first. This is all just to get some idea of what can be done, what things might cost, and what is the best way to implement them.

The interesting VFS is obviously IndexedDB - the others are mainly for comparison with it. IDB is used like a block device and is limited by the origin storage limits imposed by the browser, instead of by memory.

Unlike the other VFS implementations, when using IDB after you reload the page the database is still there. If multiple tabs are open, completed transactions in one tab are visible across all tabs. I implemented a crude locking scheme that should make this safe (more on this below).

IDB does feel quite a bit slower. Of course, it's always going to be slow compared with RAM; it takes a performance hit from using Asyncify and has IDB on top of that. I didn't implement any caching other than for checking file size, so all reads and writes use an IDB transaction. Looking at the access pattern, caching the first block of a database file might be a win because SQLite keeps checking a change counter there. I'm not sure how much benefit a general LRU-type cache would be.

A lot of the slowness (most, I think) comes from the default journal_mode setting, which journals to the same VFS as the database itself. The access pattern for a journal file is quite bad when storage is slow and you don't have a cache (adding a 1-block cache for journal files is probably a big win). Using PRAGMA journal_mode = MEMORY (or OFF) isn't as clear-cut a decision as it is with an in-memory database because your IDB database can be left corrupted (instead of completely gone) in case of a crash. Maybe what you really want is to journal into memory but buffer all database writes until the end of a SQLite transaction, then write everything in a single IDB transaction (so you can lose the journal but the database is unharmed). There must be people who know a lot more about this than I do.

Locking is a rabbit hole all by itself. The Web Locks API is nearly perfect for VFS...except that it isn't supported on Safari. Service Worker is supported across the evergreen browsers, but using it for locking across windows and workers would take some work. In the demo I used a lock file in IDB as an exclusive lock (i.e. multiple readers not allowed). I think it works and the implementation is small and all in one place (unlike a Service Worker approach), but this is not a high-quality solution (ignoring the lack of shared locks, which could be addressed) - it polls to test availability (inviting starvation) and it can leave a lock file in place on an abnormal termination (orphaned lock). So if you leave a transaction open and close the tab, you may have to use browser Dev Tools to clear IndexedDB for the page origin (just the "locks" database would be sufficient) for it to work again.

Taytay commented 3 years ago

Nice work on these benchmarks! I'll have to play with them - I've been wanting to compare journal modes and the like to have a set of recommended best practices for sql.js users.

One thing I'm immediately curious about is the overhead percentage for various aspects of your test. For example, I would expect typical usage to involve creating/opening a DB, performing a LOT of transactions, and then closing the DB when the session is done. In your benchmarks, I would expect there to be a lot of overhead in The opening and closing of the DB...that's likely intentional, but it's the first thing I want to tweak and test for my own expected usage.

For our (planned) usage of sql.js, we are okay with a local, in memory database losing data, but ideally, before we talk to our server, we will need a consistent flushed-to-disk version of the DB so that a crash after that would restore us to a consistent state. This would probably allow us to use atypical database semantics that give us the speed of PRAGMA=off, but occasionally take a hit when flushing the DB to disk before a server API call. Of course, if flushing everything to disk takes too long, we would need to write data safely and incrementally, and we'd be back to desiring a journal file of some sort that writes only what changed to disk in a consistent manner...so we are back to wanting a real "file system" that something like indexedDB affords. Boy, I wish there was a synchronous browser API to emulate the file system more performantly. :)

rhashimoto commented 3 years ago

@Taytay There isn't very much overhead for opening and closing a database, at least in terms of i/o. That seems unintuitive because we're trained to consider opening and closing files as expensive operations, but a database moves a lot of those costs (e.g. flushing) inside transactions which leaves much less to do outside. Here's a log of what happens on open:

xOpen idb
xDeviceCharacteristics 5414000
xDeviceCharacteristics 5414000
xSectorSize idb
xDeviceCharacteristics 5414000
xSectorSize idb
xRead idb 0 100
xFileControl 5414000 15
xFileControl 5414000 30

Everything there except xOpen and xRead just returns a constant, so this basically reads two IndexedDB objects (xOpen reads a small metadata object). For comparison, here's a SELECT * FROM tbl, where the table is completely empty:

xFileSize idb
xRead idb 0 8192
xDeviceCharacteristics 5414000
xRead idb 24 16
xFileSize idb
xRead idb 8192 8192
xDeviceCharacteristics 5414000

That takes three reads. One of them could have been read from cache but my VFS doesn't have a cache.

Here's a single row insertion REPLACE INTO tbl VALUES ('foo', 6):

xRead idb 24 16
xFileSize idb
xRead idb 16384 8192
xFileControl 5414000 20
xOpen idb-journal
xDeviceCharacteristics 5414000
xWrite idb-journal 0 8192
xWrite idb-journal 8192 4
xWrite idb-journal 8196 8192
xWrite idb-journal 16388 4
xWrite idb-journal 16392 4
xWrite idb-journal 16396 8192
xWrite idb-journal 24588 4
xWrite idb-journal 24592 4
xWrite idb-journal 24596 8192
xWrite idb-journal 32788 4
xDeviceCharacteristics 5414000
xWrite idb 0 8192
xWrite idb 8192 8192
xWrite idb 16384 8192
xFileControl 5414000 21
xSync idb
xDelete idb-journal
xFileControl 5414000 22
xDeviceCharacteristics 5414000

The numbers on the xRead and xWrite lines are file position and number of bytes. You can see that the operations on the database file are block-aligned and block-sized, except on the first block. In contrast, operations on the journal file are serial but are not block-aligned, so note that without a cache a write generally also requires reading first.

In my benchmark, I didn't wrap my insertions (or anything else) in a BEGIN...COMMIT block so inserting each row is a transaction. My guess is that's what dominates the timings. I mainly put the reopening in to make sure that SQLite couldn't use its own internal cache and the data really was making a successful round-trip.

Taytay commented 3 years ago

This is fascinating @rhashimoto. I love the foundation you have laid to allow you/us to so easily experiment with different implementations.

I'm now curious about even more! 1: If we're trying for a fully safe file-system aware implementation, I wonder how performant would WAL mode be? https://sqlite.org/wal.html 2: What are the expected semantics of SQLite for flushing changes to disk via the VFS? I read through some of the docs (https://sqlite.org/pragma.html#pragma_synchronous), but not in much dept. Assuming that we can rely upon calls to xSync as a form of flushing to disk, wouldn't that imply that we could perform the VFS operations in memory and then commit them to IDB only upon calls to xSync? If so, that seems like a massive win. ( Otherwise, the overhead of the file writes to the journal is extremely chatty, as you already pointed out. In your example, there were 10 writes to the journal file for a single small transaction. If each of those round trips to IndexedDB, we're gonna have a bad time. ) This is just based on loose assumptions about how xSync works, although admitedly we're pretty far over my head already. But that assumption is not supported by the list of calls in your output. I only see one call to xSync for the idb file...What journal mode were you using for that test?

This page is a good explanation of the assumptions that SQLite is making regarding the filesystem when using the journal. I think this could help us implement the VFS in a more performant way. (It also indicates why the Journal might be slower than the WAL): https://www.sqlite.org/atomiccommit.html

I think there's gonna be a good combo here that gives us the safety we want, with the file-system semantics that are IndexedDB friendly. 🤞 anyway...

rhashimoto commented 3 years ago

@Taytay Regarding WAL mode, that's a question for someone who knows more about database implementation and its trade-offs than I do. I probably only know enough to be horribly wrong.

Regarding the journal mode for those logs, it was DELETE, i.e. the default. I've been wondering myself why there isn't an xSync for the journal file, so if that's what you're puzzling over I'm puzzled, too. I do think you are correct that one should be able to cache data until an xSync.

I don't want to get too far down into the details of implementing any specific VFS in this thread. There assuredly will be many opportunities to optimize performance and tune for specific usages, like the ones you've mentioned. I'm highly gratified that this has sparked your and others' interest, and that was one reason I started the discussion, to find out if anyone else wanted this. I'm hoping to pivot to the other motivating reason, which is to figure out how to provide the platform for anyone to write their own VFS using only Javascript and without involving Emscripten.

I don't think I'm prepared to integrate Meta VFS into sql.js all by myself. The biggest issue is that using an asynchronous VFS requires an asynchronous API that sql.js currently does not have. We could start with only synchronous VFS support which still can be useful with the existing API (e.g. the Atomics experiment by @phiresky) but the best browser support (at least for now) and easiest development are unlocked with asynchronous. I'm not really interested in defining or implementing this new API; for the demo, I wrote my own simple API directly on the exported SQLite C API. So I'm hoping that perhaps some others can help out.

@lovasoa Would you accept a PR that only had the minimal Meta VFS code, with low-level tests written using the exported C API (i.e. via ccall/cwrap) and not the Database API? If not, is a branch a suitable way to proceed? I'm hoping that there is an integration path that progresses in smaller steps instead of everything at once.

There are some other things that make it frustrating to work in the sql.js tree, so much so that I moved my development to a repo I made from scratch. The restriction to ES5 Javascript is pretty painful, harder to read and maintain, and I don't know how anyone will write an asynchronous API (Promise is ES6, async/await even later). I played with adjusting the Emscripten flags to transpile to ES5 in release builds, which worked, but then updating the ESLint rules broke on the worker code which Emscripten doesn't process and that's when I gave up working directly in the tree. If someone good at tooling could take this on that would make things better all around. #438 and #284 would also make life a little nicer for those of us who like buildless development.

phiresky commented 3 years ago

One more thing I found while experimenting is pragma cache_size = 0: SQLite by default has its own internal page cache of 2MB, so if you want to really see what pages SQLite is reading you need to disable that. And it might make sense to disable it in any case since we might already be already holding stuff in our own RAM cache - or it might make sense to increase to be able to use it better when we have a "slow" storage backend like IndexedDB.

I just published the code and an article about my http-vfs and and my DOM virtual table by the way: https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/ It shows an exact page read log for each query (based on the dbstat virtual table) which could be useful.

I think those many "weird" writes to the journal file are explained by this from the file format spec: https://www.sqlite.org/fileformat.html

It's probably writing the old page content, the page ID, and the checksum in three separate writes. So in your indexed DB thing you could align your rows with those three items.

you really want is to journal into memory but buffer all database writes until the end of a SQLite transaction, then write everything in a single IDB transaction

I mean that kinda sounds like WAL mode but with extra steps. The WAL file is literally just a set of pages that are overlaid over the main db file during READs.

Can't you always have an IDB transaction open that is committed and a new one created when xSync is called?

rhashimoto commented 3 years ago

Can't you always have an IDB transaction open that is committed and a new one created when xSync is called?

I think that IDB's auto-committing transactions make this complicated. It might work as-is if you require the application to be synchronous during an SQLite transaction, i.e. it won't do something like BEGIN TRANSACTION, read the database, asynchronously do something based on the read, and then write to the database. Otherwise I think you would need to keep chaining dummy read requests or cursor steps to prevent the IDB transaction from auto-committing. That feels wasteful, but maybe it wouldn't be so bad in practice.

seidtgeist commented 3 years ago

@phiresky Congrats on your very nice blog post.

@rhashimoto Off-topic, but I found wa-sqlite and its meta async VFS implementation extremely easy to use. Very well done. It was easy to build an HTTP Range-request based VFS (inspired by @phiresky 😉) without any local persistence or sequential pre-fetch.

Maybe litestream’s approach to WAL streaming is informative? https://litestream.io/how-it-works/

Litestream works by effectively taking over the checkpointing process. It starts a long-running read transaction to prevent any other process from checkpointing and restarting the WAL file. Instead, it continually copies over new WAL pages to a staging area called the shadow WAL and manually calls out to SQLite to perform checkpoints as necessary.

I don’t think it’s possible to have a long-running read transaction open due to the single thread nature of sqlite in WASM, but maybe something similar can be achieved on the VFS or library level.

jlongster commented 3 years ago

Hello! I have built a project that solves this problem as well. For the last couple years I've built a completely local app (https://actualbudget.com), and it works well with Electron because I have native bindings to sqlite. It's been bugging me to get it working well on the web though; up until now I've used an awkward system of writing down changes manually and occasionally flushing the entire db. I really just want SQLite to control writes.

I was very inpired by @phiresky's post about using SQLite over https. I realized the problem of IDB being async though. In early June I also discovered the technique of using SharedArrayBuffer and Atomics.wait to convert an async operation into a sync one. That completely unlocks the ability to provide an IDB backend (or any storage backend) for sql.js, which in my opinion is very promising.

From there I went off to build a solution, and it works very well! Excited to show you all my project: https://github.com/jlongster/absurd-sql. absurd-sql is a filesystem backend for sql.js that allows persistent storage, primarily IndexedDB for now (but could use something else later).

I have ported the beta version of my app to use this. This is using my persistent storage layer: https://app-next.actualbudget.com/, go ahead and "try demo" and inspect IDB storage to see it.

You also see my benchmark demo: https://priceless-keller-d097e5.netlify.app

This whole thing is a bit absurd, honestly, because browsers except Chrome use SQLite to implement IDB, so we're storing SQLite in SQLite. It's also absurd how much faster this is than using raw IDB.

I actually emailed @lovasoa in June when I saw this was possible, and he pointed me to this thread. I meant to post earlier, but I kept seeing new possibilities in my research and I didn't want to report back until I had confirmed some theories.

I have learned a lot in my research, and I have a lot of opinions about everything in this thread. I'm writing a long blog post explaining everything in detail, and will publicly launch the project tomorrow. I will link the post tomorrow when it's live.

Note: my work does require a few changes in sql.js, but not much. You can see my commits here: https://github.com/jlongster/sql.js/commits/master. I will open a PR soon

Until them, I will summarize my research here:

The goal of absurd-sql

My goal is to simply have a persistent version of sql.js. That's it. It should ideally be a turnkey solution for sql.js since everyone is already using that. I'm not interested in providing my own sql library and fragmenting the community.

It's not a goal to allow for people to write their on VFS's. This is a big difference with @rhashimoto's wa-sqlite library. These are quite different goals and there is room for both projects. For me, I just want something to plug into sql.js to make it persist writes (and not have to read all data into memory).

The filesystem is the API

Because I don't care about VFS's, whether or not I achieve my goal via a VFS or FS layer is just an implementation detail. After some research, I decided that using the filesystem a the API was lot better than using a VFS.

Why? A few reasons:

I don't have to implement much. The main APIs I have to worry about are open, close, lock, unlock, write, and read. And I can implement them in simple ways. There's a lot of little edge cases I just don't have to care about. For example, since we get to use the default unix VFS we get things like this for free: https://github.com/sqlite/sqlite/blob/dff34ab96f12640212ef249cba899f172f7d69d4/src/os_unix.c#L3398-L3401

That code zero-fills the buffer and returns the right value for a short read. I get all of that for free. Emscripten already has a bunch of work to simulate a unix-like environment, so it makes sense that we can just lean into that.

The most important reason is that the storage layer is accessible outside of SQLite. In my app, I download a user's existing SQLite file and write it to the filesystem. That's all I have to do; with an FS.writeFile all of the database is written in IDB in separate blocks. That's a lot harder with a custom VFS, if it's even possible. I don't know how you'd open an existing DB that hasn't been written yet.
What's neat about using the filesystem is stuff like this:

    sqlFS = new SQLiteFS(SQL.FS, new IndexedDBBackend());
    SQL.FS.mkdir('/blocked');
    SQL.FS.mount(sqlFS, {}, '/blocked');

Now everything under /blocked will be handled with our persistent layer, but everything outside of it is still in memory. You have multiple storage options at once; in my app I have some files stored in other dirs that are persisted differently. I can even use stuff like symlinks to set things up. It's really nice.

However, I will admit that I did need to customize the unix vfs. To do this I added this small C file to sql.js: https://github.com/jlongster/sql.js/blob/master/src/vfs.c. There are two reasons for this:

We want to set blockDeviceCharacteristics. That might not really be necessary though. If it's not, I don't think we need any of that C code
We need to manually wire up the lock methods for now. This is because emscripten doesn't allow the FS to handle the fcntl syscall for locking. This should be fixed in emscripten, and once it is we don't need that code

Avoid Asyncify at all costs - SharedArrayBuffer is the answer

To solve the async problem, the two solutions are Asyncify or SharedArrayBuffer. Performance is already a sensitive issue; there's no way I'm going to use asyncify and add a lot more overhead. It also scares me in general to include an invasive transform in my stack; I'd much rather use SQLite compiled directly.

SQLite being completely synchronous is a feature. It's simple and fast model.

However we do need to perform async work inside of the FS methods. SharedArrayBuffer and Atomics.wait is the answer, and it works extremely well. I spawn a web worker and use a simple reader/writer abstraction to pass data back and forth in a SharedArrayBuffer, and internally it uses Atomics.wait to block if data is not available yet. This allows synchronization between the threads, and it's all sync.

It's incredibly fast and works well.

Atomics.wait is also key for certain performance optimizations - read below.

Safari is the only browser that doesn't enable SAB yet, but it will in the future. Until then, we can support it with a fallback implementation. In that mode, it reads the entire DB into memory on open so that reads work, and for writes it queues them up and writes them async in the background. The only problem with this is multiple connections; if another connection has performed a write while it's trying to write, it must bail and it can never write again. In practice, this means if you have the app opened in two tabs, only one of them can perform writes. This is acceptable to me for now until SAB is enabled.

Reaching amazing perf

IndexedDB is a slow database. The only operation that is somewhat fast is advancing a cursor; everything else (opening a transaction, opening a cursor, etc) is very slow (especially in Chrome). We cannot afford doing any of these things for every single read or write.

I found some critical optimizations, and I'll show the difference between my project and wa-sqlite.

My goal was to have a table like CREATE TABLE kv (key TEXT, value TEXT); and insert 1,000,000 items and read them all. A query like SELECT COUNT(value) FROM kv forces SQLite to read in all the data.

First let's do some writes. absurd-sql (with sql.js) code:

  db.exec('BEGIN TRANSACTION');
  let stmt = db.prepare('INSERT INTO kv (key, value) VALUES (?, ?)');
  for (let i = 0; i < 1000000; i++) {
    stmt.run([uid(i), ((Math.random() * 100) | 0).toString()]);
  }
  db.exec('COMMIT');

wa-sqlite code:

  await sqlite3a.exec(db, 'BEGIN TRANSACTION');

  const str = sqlite3a.str_new(db);
  sqlite3a.str_appendall(str, 'INSERT INTO kv (key, value) VALUES (?, ?)');
  const prepared = await sqlite3a.prepare_v2(db, sqlite3a.str_value(str));

  for (let i = 0; i < 1000000; i++) {
    let result = sqlite3a.bind_collection(prepared.stmt, [
      uid(i),
      ((Math.random() * 100) | 0).toString()
    ]);

    await sqlite3a.step(prepared.stmt);
    await sqlite3a.reset(prepared.stmt);
  }
  await sqlite3a.exec(db, 'COMMIT');

Please let me know if I did something wrong there. Both of these ran with a block size of 8192, cache_size=0, and journal_mode=MEMORY. I ran this in Chrome on my 2015 macbook pro. Results:

wa-sqlite: 106.9s
absurd-sql: 5.57s

I'm honestly a little skeptical here -- I didn't expect writes to be so different because I thought we both queued them up and did a bulk write at the fsync point. But there it is.

Now for reads. absurd-sql code:

  stmt = db.prepare(`SELECT SUM(value) FROM kv`);
  while (stmt.step()) {
    row = stmt.getAsObject();
  }
  stmt.free();

wa-sqlite code:

  await sqlite3a.exec(db, `SELECT SUM(value) FROM kv` data => {});

wa-sqlite: 5.65s
absurd-sql: 1.98s

Again let me know if the wa-sqlite code is wrong. I'm not posting this to try to put down wa-sqlite; I'm writing down my research so that everything can benefit and maybe it can do the same optimizations.

I also benchmarked it against raw IndexedDB, meaning if implemented the same code against IndexedDB directly. With the speed it has, it puts raw IndexedDB to shame. Here are some graphs that illustrate it.

(All of the benchmarks are here, particularly the query files: https://github.com/jlongster/absurd-sql/tree/master/src/examples/bench)

These are the same benchmarks as above, first doing the writes:

And them doing the SUM query:

So what are the optimizations?

I realized that the blocking Atomics.wait API opens up something amazing: long-lived IndexedDB transactions. Because the worker thread can block the whole thread, we can open a transaction and it can block the thread from auto-committing it. This is key.

That means if we do SELECT SUM(value) FROM kv and SQLite does a lot of read requests, we can do them all on one IndexedDB transactions. That is a HUGE perf boost.

And now that we have a single read transaction, if we doing lots of sequential read requests we can open a cursor and advance it! That is another huge perf boost. See the read implementation here: https://github.com/jlongster/absurd-sql/blob/master/src/indexeddb/worker.js#L184

So lots of sequential reads will just open one cursor in a transaction and iterate though.

Similar optimizations occur with writes; when SQLite locks the file for writing, we open a readwrite transaction and do everything in a single transaction.

Proper locking semantics

Another key thing from my research: locking is super important. Since each tab in the browser will be a separate connection, we have to make sure that we aren't going to overwrite data.

We already have IndexedDB transactions semantics to work with. readonly transactions can run in parallel, while readwrite transactions with only run one at a time. We can use this to implement locking.

Here's the method that implements locking, and the comment above it is helpful: https://github.com/jlongster/absurd-sql/blob/master/src/indexeddb/worker.js#L423

When a SHARED lock is requested, we open a readonly transaction. It's fine if many SHARED locks exist at once, and that works with parallel readonly transactions. However when a write lock is requested, we open a readwrite transaction. Once that transaction is running, we know we control the database because of IDB semantics.

The only problem is that another thread might have written data down between the time that we requested a readwrite lock and got it. SQLite already gives us this information! Once we've grabbed a readwrite transaction, we can read some bytes from the file which represent SQLite's "change counter" (see docs). If that counter is the same as what we had when we requested a write, it's safe to write!

This is only possible because we have long-lived IDB transactions because we can open a readwrite transaction once and make sure we have it over the course of the write lock. If we had to reopen it, it would destroy our ability to safely write to the db and it would get easily corrupted.

So Atomics.wait unlocks this as well.

The key thing is that we don't maintain any lock state. It's all represented as IDB transactions. The fantastic thing about that is that a db can never be left permanently locked. I've noticed that browsers will forcefully unlock IDB when you refresh the page, i.e. kill all transactions. We gain the assurance that all SQLite locks will also be cleared.

Always use an in-memory journal and no WAL

A WAL doesn't make sense. In our case, either way data is going to end up in IDB. And while reading through the SQLite code, a WAL seems to require shared memory via mmap. There are assumptions is makes that I don't think we can fulfill. There's just no reason to use it. We just take writes, buffer them up, and in fsync flush them out.

Anything other than an in-memory journal also doesn't make sense. We don't need hot journals. IDB already gives us an atomic guarantee -- either the writes worked or they didn't. We still need the in-memory journal so we can do ROLLBACK though. (A DELETE journal_mode will work, but it's a lot slower since it's writing the journal out the IDB)

Conclusion

Anyway, sorry for this huge dump. It's been a lot of research! I'm happy to discuss this more, I probably missed some stuff. I'm going to release my project tomorrow. I'm really happy that this actually worked out. Again you can see it working here: https://app-next.actualbudget.com/.

lovasoa commented 3 years ago

Thank you very much for this research. I'm very excited about your upcoming PR. It will give a common interface on top of which we can have both an IndexedDB and an HTTP backend as independent libraries, which is great.

emscripten doesn't allow the FS to handle the fcntl syscall for locking. This should be fixed in emscripten, and once it is we don't need that code

Did you contact @kripken or open an issue in emscripten about that ? I'm not a huge fan of adding C code here...

jlongster commented 3 years ago

@rhashimoto I'm been thinking about and I think you can benefit from long-lived transactions too. Even though your functions are async, when you return data to SQLite if it is going to continuously read more data it's going to call back into your VFS synchronously. So I think I can store an opened transaction locally, and on the next read check if it's still available. I think you'd see similar perf if you did that (not 100% of write transactions yet, haven't thought through it)

You'll have to somehow check if the transaction gets closed though.

jlongster commented 3 years ago

Did you contact @kripken or open an issue in emscripten about that ? I'm not a huge fan of adding C code here...

Not yet, and I totally agree. It's very minimal: https://github.com/jlongster/sql.js/blob/master/src/vfs.c

But yes ideally we wouldn't have to manage the locks like that. It's also annoying because we have to manually register the FS lock/unlock methods so they can be called, and there's potential for memory leaks because of how we have to manage everything.

Will followup with this

rhashimoto commented 3 years ago

That's really nice work, @jlongster!

I love how the locking can be handled so elegantly with IndexedDB transactions; I think that's my favorite part.

Your wa-sqlite benchmark code looks reasonable. The IndexedDB VFS in wa-sqlite has not been tuned for performance because I wanted to keep the sample code simple. In particular, I think it gets crushed in the write benchmark because because its cache implementation is too crude. That being said, however, your approach doesn't depend on a cache so it sidesteps all those tuning problems completely.

Anything other than an in-memory journal also doesn't make sense.

This statement seems a little strong. There could be memory-intensive apps for which a large transaction might not fit in memory.

@rhashimoto I'm been thinking about and I think you can benefit from long-lived transactions too.

Yes, your post got me thinking about that, too!

jlongster commented 3 years ago

Thanks @rhashimoto! It's too bad I didn't find your project until late into my research because I would have reached out earlier. Thank you for all your work!

Yeah, I try to do as little as possible, so depending on SQLite's cache, let IDB do the locking etc. It leads to some nice guarantees. I am also happy about how locking worked out; I had no idea how to solve the problem of an orphaned lock before that, and from what I've seen with IDB it was way too easy for that to happen if I tried to manage that state myself.

This statement seems a little strong. There could be memory-intensive apps for which a large transaction might not fit in memory.

Totally fair. I don't fully understand what happens with large transactions tbh, and how cache spilling works. I won't say "doesn't make sense" again but more "this is the recommended". Need to think more about what happens with large transactions. (In my lib, opening a journal will be very slow because it spins up a new worker and connects to a new database. each file has its own db & worker, so it will greatly slow down transactions)

jlongster commented 3 years ago

Just realized I never link to my post, it's here: https://jlongster.com/future-sql-web

I'll add a link to wa-sqlite from my project as an alternative

Sorry I have put a PR to sql.js yet, today has been crazy busy. Will follow-up tomorrow :)

Taytay commented 3 years ago

@jlongster : I have been following your work for some time now. I'm very excited about what you've done with absurd-sql! It fills a previously important missing piece for sql.js and its adoption, and then manages to do so in a surprisingly performant manner. Thanks for taking the time to dive into your assumptions here, and nice work!

jlongster commented 3 years ago

Thank you! Thanks for others here doing this research too!

I am SO sorry I dropped the ball here. I had put off a lot of stuff to get this launched, so I was really busy catching up on things for the last couple weeks. Going to work on a PR for emscripten and then will open a PR here too, and link the emscripten PR.

jlongster commented 3 years ago

Opened up a PR! Sorry for the wait. Will keep up with this from now on. https://github.com/sql-js/sql.js/pull/481

rhashimoto commented 2 years ago

FYI Safari apparently re-enabled support for SharedArrayBuffer in release 15.2 last month. This means that the Atomics trick used by @phiresky and @jlongster to call asynchronous APIs (e.g. IndexedDB) synchronously should be usable across the four major evergreen browsers.

This feature requires setting COEP/COOP headers, which does put some restrictions on what a web app can do. For example, this can interfere with third-party OAuth and payment integrations.

Tails commented 2 years ago

I am looking for a way to use a storage backend such as Torrent or Sia/SkyNet, so both @phiresky XHR HTTP backend for fetching, as @jlongster's persistenc. Is there a way your works could be merged or combined?

Would it be easier to include the HTTP backend to AbsurdSQL, or easier the other way around?

I also found this: https://github.com/riyaz-ali/sqlite3.js by @riyaz-ali.

rhashimoto commented 2 years ago

I am looking for a way to use a storage backend such as Torrent

You could also look over this unmerged SQLite over WebTorrent PR for a different SQLite library.

Tails commented 2 years ago

And there is now also this: https://boredcaveman.xyz/post/0x2_static-torrent-website-p2p-queries.html https://gitlab.com/boredcaveman/p2psearch

wes-goulet commented 2 years ago

Chrome-based browser have implemented the Access Handle for Origin Private File System (and supposedly it's in development for Safari). I'm wondering if that would be a more performant alternative to using IndexedDB... seems like that's the use case the authors had in mind:

Distribute a performant Wasm port of SQLite. This gives developers the ability to use a persistent and fast SQL engine without having to rely on the deprecated WebSQL API.

Would it be possible to combine this meta-VFS with an origin private file system implementation?

rhashimoto commented 2 years ago

Safari was actually first to ship Access Handle in its production channel. Chrome and Edge made it to production last month. It's Firefox where it is still in development.

I played with it as a SQLite back-end a couple months ago in Chrome Canary and my benchmarks had it slower than IndexedDB at that time (though a lot nicer to program with).

zzyhdu commented 2 years ago

I only see one call to xSync for the idb file...What journal mode were you using for that test?

The SQLITE_IOCAP_SEQUENTIAL property means that information is written to disk in the same order as calls to xWrite(). So if you return 'SQLITE_IOCAP_SEQUENTIAL' from xDeviceCharacteristics, sqlite will omitted a xSync for journal.

quolpr commented 2 years ago

Hey guys! I was able to achieve the same performance for wa-sqlite as absurd-sql has. I created new VFS for wa-sqlite and made the same optimization as absurd-sql has — smart cursor logic, delay write to the time of transaction commit.

So, I adopted James benchmark from the comment https://github.com/sql-js/sql.js/issues/447#issuecomment-897326740, and got almost the same amazing performance for both absurd-sql and wa-sqlite versions. You can run it here — link, and the source code is located here — trong-orm/Benchmark.tsx. Here is a shortcut of benchmark:

runInTransaction(db, async () => {
  const time = new Date().getTime();
  setLogs((l) => [...l, "Start inserting..."]);

  for (let i = 0; i < 100; i++) {
    await runQuery(
      db,
      sql`INSERT INTO kv (key, value) VALUES ${sql.join(
        [...Array(10_000)].map(
          () =>
            sql`(${makeId()}, ${(
              (Math.random() * 100) |
              0
            ).toString()})`
        )
      )}`
    );
  }

  runAfterTransactionCommitted(db, async () => {
    setLogs((l) => [
      ...l,
      `Done inserting in ${(new Date().getTime() - time) / 1000}s`,
    ]);

    const summingTime = new Date().getTime();

    setLogs((l) => [...l, `Summing...`]);
    await runQuery(db, sql`SELECT SUM(value) FROM kv`);
    setLogs((l) => [
      ...l,
      `Done summing in ${(new Date().getTime() - summingTime) / 1000}s`,
    ]);
  });
})

Both of these ran with a block size of 8192 and journal_mode=MEMORY (as James did). But I set cached_size to -100 cause when I was setting it to 0 absurd-sql was not handling page_size correctly. Or was just breaking 🤔 Not sure why, maybe some issues with zero block of sqlite reading/writing race condition.

I usually get such values for absurd-sql on chrome with M1 MAX:

Clearing data first...
Clearing done!
Reading pragma...
[
  [
    {
      "cache_size": -100
    }
  ],
  [
    {
      "journal_mode": "memory"
    }
  ],
  [
    {
      "page_size": 8192
    }
  ]
]
Reading pragma done!
Start inserting...
Done inserting in 5.863s
Summing...
Done summing in 0.553s

And such values for wa-sqlite new VFS:

Clearing data first...
Clearing done!
Reading pragma...
[
  [
    {
      "cache_size": -100
    }
  ],
  [
    {
      "journal_mode": "memory"
    }
  ],
  [
    {
      "page_size": 8192
    }
  ]
]
Reading pragma done!
Start inserting...
Done inserting in 6.765s
Summing...
Done summing in 0.564s

Here is the result with cache=0 for wa-sqlite new VFS, just in case:

Clearing data first...
Clearing done!
Reading pragma...
[
  [
    {
      "cache_size": 0
    }
  ],
  [
    {
      "journal_mode": "memory"
    }
  ],
  [
    {
      "page_size": 8192
    }
  ]
]
Reading pragma done!
Start inserting...
Done inserting in 7.84s
Summing...
Done summing in 0.718s

You can see that difference between wa-sqlite new VFS and absurd-sql are not so big.

So, what does it mean? We don't need to use SharedArrayBuffer anymore, that unlocks much more features that we were not able to afford with absurd-sql (like embed YouTube video or make some third party payment integrations). And it seems Asincify doesn't have so much performance impact if IDB used as a backend (not sure why btw, cause with memory test the impact was much more visible).

I hope I didn't make any mistakes in benchmarking or in implementations, so I will be glad if you will point me to something that I missed.

Also, I need to notice — I used patched version of absurd-sql with this PR applied — https://github.com/jlongster/absurd-sql/pull/56 . It actually waits when the COMMIT will write data to disk, so INSERTing could be measured properly. Otherwise, it will impact on next SELECT perfomance (cause SELECT will need to wait when IDB will finish writing of previous INSERT query)

Also, you can play around with more complex demo — https://main--trong-example.netlify.app/

https://user-images.githubusercontent.com/7958527/179220391-c60cc4a2-98d7-4020-85af-f70492d3ee8a.mp4

quolpr commented 2 years ago

Also, the code for VFS is pretty messy — I am going to refactor it soon 🙂 Also, I appreciate all the work that @rhashimoto did — his discussions in repo were very helpful! But the downside is that wa-sqlite is under GPLv3, so it restricts the commercial usage a lot 😐. At some cases, you will need to open-source your closed source code, or ask the author to give you some other license.

I hope somebody (or maybe me 😅) will make MIT version of sqlite + Asyncify (with all respect to rhashimoto! But such license really restricts code usage a lot). Overall, not so much job needed to be done on the first look 🤔

rhashimoto commented 2 years ago

@quolpr I'm impressed with the effort that must have gone into this, especially writing a new IndexedDB VFS. I have these thoughts:

I think this particular benchmark is bound by storage speed. Most database usage is probably bound by storage speed so if you only have one benchmark then that's not a bad one, but I think you might see more performance differences between Asyncify and non-Asyncify code with queries that are computation bound. Even if these measurements are accurate and fair, absurd-sql might still have a substantial performance edge on a different workload.
Performance is not the only calling card of absurd-sql. It is also notable that it retains the widely-adopted SQL.js API, and keeps a WASM size half that of an Asyncify-ed SQLite.
There are some proposals to make COOP/COEP less restrictive while still allowing SharedArrayBuffer. There are also proposals to add coroutines to WASM so Asyncify would not be necessary. My point is...unclear. I guess it's that the pros and cons are likely to evolve, yeah, like that isn't always the case.
I'm gratified that having a framework to experiment with SQLite browser storage techniques is useful, even if everyone hates the license. 😝

quolpr commented 2 years ago

Your timings for absurd-sql and wa-sqlite are identical to the millisecond, which seems unlikely. I suspect a copy-paste error in the post.

You are right! I corrected the posts. Thanks!

I think this particular benchmark is bound by storage speed. Most database usage is probably bound by storage speed so if you only have one benchmark then that's not a bad one, but I think you might see more performance differences between Asyncify and non-Asyncify code with queries that are computation bound. Even if these measurements are accurate and fair, absurd-sql might still have a substantial performance edge on a different workload.

Yeah, agree. But for now, I also made some heavier tests — and perfomance of wa-sqlite VFS is still good. Also, I was not able to use absurd-sql with 64kb block size, while 64k block size with wa-sqlite make some performance boost too.

Performance is not the only calling card of absurd-sql. It is also notable that it retains the widely-adopted SQL.js API, and keeps a WASM size half that of an Asyncify-ed SQLite.

You are right, but regarding size — brotli size difference(I took Oz version of sql.js) is around 60kb.

But of course it will take more time to unbrotli and load the wa-sqltie, but I think numbers will be roughly the same 🤔.

I used this commands to test:

There are some proposals to make COOP/COEP less restrictive while still allowing SharedArrayBuffer

Yeah, I hope it will happen. Cause right now restrictions are too strict.

I'm gratified that having a framework to experiment with SQLite browser storage techniques is useful, even if everyone hates the license. 😝

Yeah! You made an excellent job. It's so easy to write any backend you want just in JS 🙂

rhashimoto commented 2 years ago

60 KB in bandwidth and 15% in insert time seem like pretty reasonable costs for Asyncify, assuming that holds up. I wouldn't go so far as to claim that is the "same performance", though, unless you collect other benchmarks where the edge goes the opposite way or indicate that this one is an uncommon outlier.

sql-js / sql.js