Closed acquamarin closed 5 months ago
I am adding this comment here to brainstorm with people who wanted a feature to be able to query databases that is on s3.
In particular I want to understand 2 questions:
The reason I'm questioning this is that in general using Kuzu to query s3 databases as if those databases are local sets Kuzu up for very poor performance. I'm sure there are many optimizations to do but we don't do them yet. For example there are chances that we'll make many s3 range calls to scan many 4KB pages (though s3 is likely to optimizing these calls). Therefore if you're going to query the s3 database many times, it may just pay off to download the database files to a temporary local machine, query it many times from there and then throw the directory away. In fact, we can consider implementing a Kuzu extension that does this on users' behalf, so from your perspective you still attach to a remote database in a read-only mode, and then query it. At the background while you do the attachment, we can just download the database directory s3 into a temporary directory. So nothing changes on user's behalf but if you'll issue many queries, it may just pay off.
It would be good to get feedback on these. We have a s3-and-kuzu channel in Discord where people can comment.
Attach remote database design
Design goal: Allow user to execute read-only queries on a remote kuzu database which can be stored on S3 and accessed via
https
. We are not going to allow users to attach multiple kuzu instances at a time, and users are not allowed to query the original local kuzu database after attaching. The reason why we set this restriction is to keep the feature simple and straightforward.Prerequisite
The
httpfs
extension should be installed and loaded beforehand using:Feature:
1. User can attach/detach a remote database using attach/detach statement as follows:
Attach:
ATTACH 'S3://kuzu_data/db' as uw (type kuzu)
Detach:
DETACH uw
a. After attaching a remote database, users can no longer query/access the original local kuzu database.
b. Users can only query the original local kuzu database after issuing the detach statement.
c. Only one remote database can be attached at a time. meaning:
The second attach statement will cause an exception.
d. The Kuzu database will be attached in read-only mode, meaning write queries will be rejected. If the database path is an invalid Kuzu path, an exception will also be raised.
Implementation
2. Querying
We allow users to attach a remote kuzu database and query it the same way as they would a local kuzu database.
Example:
3. Benchmark Result
Downloading/Caching Optimization
Let us also have a version where users can choose to download the entire database from s3 with a single s3 call to a local temporary directory, so we cache the remote database locally. I think we should keep a map of which remote directory has been mapped to which local directory, so that if a user attaches and detaches multiple times during a session, we don't keep downloading these remote databases over and over again.
Manually deleting cached database: Users may want to refresh a cached database from time to time. We can provide a command to "clear the cached remote Kuzu db". That is we can have a function similar to clearning Postgres/DuckDB schema cache
CALL clear_cached_remote_kuzudb(
S3://kuzu_data/db) RETURN *;
This function should be called after the user detaches fromS3://kuzu_data/db
. Otherwise it should fail. If the user has detached from this database, then this function can remove the local directory to which we cached the database.When to delete the locally cached database automatically: The rule should be that if any behavior of the system leads to the extension to be uninstalled then we should delete the local temporary directory. For example, if crashing means that the remote Kuzu extension (or https extension) is uninstalled, then we must have deleted all data related to the extension during crashing or during recovery.
Warning/feedback when user attaches to a cached database: When user calls
ATTACH 'S3://kuzu_data/db' AS uw
, if this database has been cached, it would be good to give some feedback to the user: "Attaching to the cached version of the database S3://kuzu_data/db in /tmp/kuzu/attached-kuzu-dbs/uw/. So the user knows that they have attached to a local copy and can manually remove the cached version if they have to.We should write tests to test these manual and automatic cache removals and write documentations for these cases.