Attach remote database design

Design goal: Allow user to execute read-only queries on a remote kuzu database which can be stored on S3 and accessed via https. We are not going to allow users to attach multiple kuzu instances at a time, and users are not allowed to query the original local kuzu database after attaching. The reason why we set this restriction is to keep the feature simple and straightforward.

Prerequisite

The httpfs extension should be installed and loaded beforehand using:

INSTALL httpfs;
LOAD EXTENSION httpfs;

Feature:

1. User can attach/detach a remote database using attach/detach statement as follows:

Attach:

ATTACH 'S3://kuzu_data/db' as uw (type kuzu)

Detach:

DETACH uw

a. After attaching a remote database, users can no longer query/access the original local kuzu database.

b. Users can only query the original local kuzu database after issuing the detach statement.

c. Only one remote database can be attached at a time. meaning:

ATTACH 'S3://kuzu_data/db' AS uw (type kuzu)
ATTACH 'S3://kuzu_data/db1' AS uw1 (type kuzu)

The second attach statement will cause an exception.

d. The Kuzu database will be attached in read-only mode, meaning write queries will be rejected. If the database path is an invalid Kuzu path, an exception will also be raised.

Implementation

Attach: After the user issues the attach statement, we reconstruct the fields in the database class using the provided remote file path.
Detach: After the user issues the detach statement, we reconstruct the fields in the database class using the original file path (the one provided on startup).

2. Querying

We allow users to attach a remote kuzu database and query it the same way as they would a local kuzu database.

Example:

// Assume person is a valid table in uw database
ATTACH 'S3://kuzu_data/db' AS uw (type kuzu)
MATCH (p:person) RETURN p.*

[!NOTE] Write queries are not allowed.

3. Benchmark Result

I am calling the HTTPFileSystem::read() and LocalFileSystem() read APIs to peform a sequential read of ldbc100 comment.csv file (21GB) to evaluate the read peformance under different filesystems.(results measured in seconds) Block size means: the number of bytes to read for each FileSystem::read() call.	Block size	http file system read time (sec)
4KB	145	5
64KB	86	3
256KB	87	3
512KB	105	3
1MB	86	3
4MB	32	3

Downloading/Caching Optimization

Let us also have a version where users can choose to download the entire database from s3 with a single s3 call to a local temporary directory, so we cache the remote database locally. I think we should keep a map of which remote directory has been mapped to which local directory, so that if a user attaches and detaches multiple times during a session, we don't keep downloading these remote databases over and over again.

Manually deleting cached database: Users may want to refresh a cached database from time to time. We can provide a command to "clear the cached remote Kuzu db". That is we can have a function similar to clearning Postgres/DuckDB schema cache CALL clear_cached_remote_kuzudb(S3://kuzu_data/db) RETURN *; This function should be called after the user detaches from S3://kuzu_data/db. Otherwise it should fail. If the user has detached from this database, then this function can remove the local directory to which we cached the database.

When to delete the locally cached database automatically: The rule should be that if any behavior of the system leads to the extension to be uninstalled then we should delete the local temporary directory. For example, if crashing means that the remote Kuzu extension (or https extension) is uninstalled, then we must have deleted all data related to the extension during crashing or during recovery.

Warning/feedback when user attaches to a cached database: When user calls ATTACH 'S3://kuzu_data/db' AS uw, if this database has been cached, it would be good to give some feedback to the user: "Attaching to the cached version of the database S3://kuzu_data/db in /tmp/kuzu/attached-kuzu-dbs/uw/. So the user knows that they have attached to a local copy and can manually remove the cached version if they have to.

We should write tests to test these manual and automatic cache removals and write documentations for these cases.

I am adding this comment here to brainstorm with people who wanted a feature to be able to query databases that is on s3.

In particular I want to understand 2 questions:

Download Locally Option: Are you in a position to download the Kuzu database directory on s3 to the local machine and then just query the database from your local machine as many times as possible? If not, why not? What limits this option, e.g., does your environment have a limitation about using a local disk?

The reason I'm questioning this is that in general using Kuzu to query s3 databases as if those databases are local sets Kuzu up for very poor performance. I'm sure there are many optimizations to do but we don't do them yet. For example there are chances that we'll make many s3 range calls to scan many 4KB pages (though s3 is likely to optimizing these calls). Therefore if you're going to query the s3 database many times, it may just pay off to download the database files to a temporary local machine, query it many times from there and then throw the directory away. In fact, we can consider implementing a Kuzu extension that does this on users' behalf, so from your perspective you still attach to a remote database in a read-only mode, and then query it. At the background while you do the attachment, we can just download the database directory s3 into a temporary directory. So nothing changes on user's behalf but if you'll issue many queries, it may just pay off.

As a follow up question to 1 and my note above, if Download Locally Option is an option for your setting, would you prefer it over directly querying s3?

It would be good to get feedback on these. We have a s3-and-kuzu channel in Discord where people can comment.

kuzudb / kuzu

Attach remote kuzu database design #3465