Support changing the encryption key in libsql-server

psarna commented 5 months ago

With encryption enabled, we would like to be able to change the encryption key. This operation is supported in SQLite3MultipleCiphers and called rekey. Important things to take into account:

Rekeying is not supported for a database in WAL mode, as mentioned here in a slightly scary message: https://github.com/utelle/SQLite3MultipleCiphers/blob/main/CHANGELOG.md#changed-34
Rekeying operation in SQLite3MultipleCiphers is based on setting up a new encryption key and running vacuum.
Vacuum is an operation that rewrites the database to a new file, purging unused free pages along the way.

Taking all the above into account, a rekeying operation can be safely implemented as follows, with incurred downtime:

Stop serving user requests to stop reads and writes from happening
Checkpoint the database in TRUNCATE mode - that leaves us with no WAL file, no WAL frames
Switch the journaling mode to the regular journal, by calling PRAGMA journal_mode=delete
Now that we're not in WAL mode, it's safe to call PRAGMA rekey=new_key (or in our case, the C API equivalent sqlite3_rekey). That will perform a VACUUM while also encrypting all data with new_key
Switch the journaling mode back to WAL with PRAGMA journal_mode=wal
Resume serving user requests, all data is now encrypted with new_key

An alternative solution would be to try and bring back support for rekeying in WAL mode, but there are 3 downsides to consider:

Rekeying was disabled after reportedly causing db corruption in WAL mode, probably because VACUUM in WAL mode has a different implementation, and we could easily end up with some pages being encrypted under the old key, and some under the new key. That's a red flag.
We'd have to invest quite some time in modifying SQLite3MultipleCiphers core code, as opposed to reusing interfaces that are already battle-tested.
This whole ordeal is going to include VACUUM-ing the database and rewriting the WAL pages, so it will all happen under a huge write lock. That's not far from downtime from users' perspective anyway. And the space amplification has the same order of magnitude, because the dominant operation is VACUUM.

psarna commented 5 months ago

FYI: I successfully rekeyed a libsql-server database offline with the steps above, to verify that the idea makes sense

athoscouto commented 5 months ago

@psarna the downtime should be fine, IMO. What I'm concerned about is the operation failing while it is being executed, leaving the database in a broken state. Particularly, should we be worried about running out of disk space while we execute those steps?

psarna commented 5 months ago

This operation has the same space amplification as VACUUM, which we already do periodically, so it's no worse. What we can definitely add is a validation, which refuses to rekey if it sees there's not enough disk space

LucioFranco commented 5 months ago

I think this overall sounds good, I don't think we should go the second path at all right now. As long as we ensure that we can restore the DB from its original if it fails (from the point that we started the rekey). The downtime is totally fine as long as we communicate that but we already kinda have downtime with upgrades anyways so no harm there.

ananondev commented 3 months ago

Hi all. Thanks for the work on this library.

Any word as to progress on this issue?

This issue references libsql-server but I ran into a similar concern to OP in a potentially more common use-case: client-side database re-keying. The issue is that users who are storing e.g. some data on their own disk or in browser may accidentally leak their password or just have a security practice to rotate passwords periodically. But without support to rekey the database, this doesn't seem possible. For example, an application developer using libsql might generate some secret key K then store K encrypted with the user's password p (call it H(p)) and then use K as the key to the db, changing p whenever needed, consequently re-encrypting and storing H(p) whenever p changes - but then if K ever leaks, the user is back to square one without the ability to do anything in the way of protecting their data except for just deleting their db.

I attempted to implement support for rekeying in various ways, firstly by way of OP's suggestion:

`PRAGMA rekey`

PRAGMA wal_checkpoint(TRUNCATE); PRAGMA journal_mode=delete; -- PRAGMA key = '?'; -- though this is not needed, because libsql's set_encryption_key calls sqlite3_key(..) PRAGMA rekey = '?'; PRAGMA journal_mode=wal;

And while this does cause the database file content to change, the file cannot be opened again by libsql for some reason. It's not like libsql does anything to the password it hands off to sqlite3_key, and sqlite3_key already does handle converting a plaintext password into an aes key by kdf, so it's not like calling PRAGMA rekey = '?'; would consider the password different than the one that sqlite3_key via PRAGMA key = '?' gets -- unless I'm missing something here about the cipher settings that may be known somehow to sqlite3_key under the hood but not to the PRAGMA interfaces that sqlite exposes for (re)key. In other words, is libsql bypassing SQLCipher's settings somehow, somehow related to the non-existence of various PRAGMA settings?

I then began looking into the libsql code a little more intensively. I noticed, oddly, there is already a libsql function reset_encryption_key which calls sqlite3_rekey. But I see nothing in any tursodatabase repo calls reset_encryption_key, which makes me wonder if the team may have actually tried to support rekeying in the past but ran into some roadblock and iceboxed it?

What confused me is why PRAGMA rekey didn't work when it merely calls sqlite3_rekey.

`PRAGMA key` instead of `encryptionKey`

So I tried omitting the encryptionKey setting (which causes libsql's set_encryption_key to be called), and then manually executing a PRAGMA key = '?'. The thought was that maybe a rekey over the top of this would actually work if we try to avoid whatever libsql is doing and go straight to sqlcipher. And this does seem to work, temporarily, but at some strange number of sql statements later, sqlite suddenly claims that the file is not a database - even without any PRAGMA rekey!

This leads me to think that libsql is doing something internally like attempting to maybe save or re-open the database at some point with some configuration that it thinks it had from the prior db connection's encryption_config - which would be nil of encryptionKey is nil even though a PRAGMA key was called.

This would be consistent with why the other attempt of calling rekey didn't work - the rekey caused the underlying data to change but libsql's rust tries to reopen or save or something upon the db with the old encryption_config.

This led me to consider modifying libsql's Rust to try to update the cached encryption_config via some new method to access reset_encryption_key, but to be honest, I found the libsql code a little difficult to approach, though it might just be my lack of full mastery of Rust. Either way, encryption_config would need to be updated on the db handle, provided that's the real issue.

`ATTACH DATABASE` ... `KEY`

I tried to work around this by even using ATTACH DATABASE since it is supposed to support supplying a ... KEY='?' argument, then copying the original db into the attached DB. And this, again, seems to work, except that KEY is fully ignored by libsql, causing the attached db to end up in plaintext. This is consistent with the lack of KEY in ATTACH tests in libsql even though it's in the sqlcipher attach docs.

Via replication

I have also looked into using libsql's fancy db to db replication, e.g. via syncUrl, but it all seems to assume that syncUrl is a remote URL.

I've also tried explicitly setting the cipher etc

SELECT sqlite3mc_config('default:cipher', 'aes256cbc');, toggling HMAC, et al but it's not clear if it's even doing anything.

In any case, I have noticed various other issues (e.g. [1], [2]) talking about how certain PRAGMA settings are just invisibly not supported, and honestly I think this ought to be disclosed a little more clearly to people. The real differences, and what is actually supported, should be clearly called out on the webpages, rather than saying "it's really sqlite" over and over. Sorry to take that stance but these sorts of discrepancies really affect people and I think it's only fair when asking for contributions from the public to disclose such things. Without rekey support, I can't really use libsql... even though I'd really like to.

LucioFranco commented 3 months ago

@ananondev this is a really nice write up! Thanks for digging in and finding out some rough edges, for sure the rust code we have is not the best but we are working on making it a bit more approachable. We plan on revisiting some of this work soon over the next few months. We will update here once we dig in some more into what you said and the code it self.

ananondev commented 1 month ago

Why was this closed as "not planned"?

penberg commented 1 month ago

@ananondev Looks like Linear was too eager to auto-close. Reopening.

tursodatabase / libsql