Open bluestreak01 opened 2 years ago
I am curious how this exactly would work. I understand the high-level logic, but I don't see how this related to QDB and what QDB would do to assist this i.e. how would QDB covert data to cold-storage to save cost? Isn't that something that's depended on the unique setup of each server and not on QDB?
Cold storage for all intents and purposes is a device that does not support any useful form of random access. Our intent here is to implement a virtual area backed by local storage that would intelligently swap data from and to cold storage on access. Ultimately there will be addressable partition in table that would incur a time penalty on access if data happens to be remote.
Okay, interesting. Thank, Vlad. I am starting to use QDB in a similar fashion, hence why I was wondering what a natural QDB solution would be like.
Looking forward to this feature - for use cases where you need to store large amount of data this is the most important feature one needs. One additional comment to data swapping - there will have to be some staging area with expiry and size limits for swapped data otherwise you would blow up the local storage very quickly if you start querying too much data living in cold storage. Also there should be sanity checks that if someone asks for data in cold storage that would be too large (won't fit to staging area) such query will be rejected or similar approach...
ideal would be to store cold on some external service like s3. sufficient would be defining separate file path on the server. is this in the scope of this feature?
ideal would be to store cold on some external service like s3. sufficient would be defining separate file path on the server. is this in the scope of this feature?
Yes, cold partitions will be stored on S3 or any other BLOB storage supported by OpenDAL.
@puzpuzpuz May I ask where the development of integrations with OpenDAL happens?
@puzpuzpuz May I ask where the development of integrations with OpenDAL happens?
It's a part of QuestDB Enterprise.
@puzpuzpuz May I ask where the development of integrations with OpenDAL happens?
It's a part of QuestDB Enterprise.
Good to know. I'm maintaining the OpenDAL Java binding. Recently, we implemented Java IO's abstractions to support streaming read/write - https://github.com/apache/opendal/pull/4626
I hope the OpenDAL integration a basic non-competing part of your Enterprise edition so that we can deal with the challenges together in the upstream.
I remember QuestDB members reported issues on the distribution, and I suppose that there should be still other features missing to make a complete Java SDK. Looking forward to your inputs :D
Hi @tisonkun,
We're using OpenDAL from Rust, so Java bindings aren't relevant for us.
Summary
Chronologically older partitions should go to an object store to scale and be more cost efficeint. At the same time, data in these partitions should remain accessible, although with a performance penalty. There should be both manual and automated methods to "mark" a partition or a group of partitions as cold.
To make the data accessible by other software, the partitions will be stored in Apache Parquet format.
This feature is a multi-step consisting of the following steps, but not only:
SELECT * FROM parquet('my_file.parquet');
virtual table function allowing users to read Parquet files without persisting them on disk.This implementation also depends on:
Intended outcome
Partition is "marked" as cold. The physical location of data in this partition will change in a read-consistent fashion. Data in partition remains available, row counts remain consistent with data being there. When accessed, data is delivered to a query with lag, consistent with fetching data from cold storage.