toeverything / OctoBase

🐙 OctoBase is the open-source database behind AFFiNE, local-first, yet collaborative. A light-weight, scalable, data engine written in Rust.
https://octobase.dev
GNU Affero General Public License v3.0
1.33k stars 83 forks source link

impl BlobStorage for BucketStorage #464

Closed thorseraq closed 1 year ago

thorseraq commented 1 year ago
Xuanwo commented 1 year ago

Let me take a look over this issue!

Xuanwo commented 1 year ago

For the very first look over BlobStorage:

https://github.com/toeverything/OctoBase/blob/c94cf1084f702baffefc87b18f532597ae97b8f4/libs/jwst/src/types/blob.rs#L15-L37

I have the following questions:

  1. What's the params will carry in get_blob?
  2. Most storage services requires the total content length before put_blob. Otherwise, we will need to buffer data in memory or use API like MultipartUploads. Is it possible to know the length before calling put_blob?
  3. I'm not sure if it's a good idea put the workspace concept in BlobStorage. Most storage services requires list all files under a folder/workspace and than delete it which could be slow.
  4. So does get_blobs_size, this will need to list all files in workspace to calculate the total size.

For question 3 & 4, maybe we can store those data in an extra db instead?

thorseraq commented 1 year ago
  1. can reference here, it's a image query optimization https://github.com/toeverything/OctoBase/blob/f2b25ceb513eae9db40f644668b4b06e46412ecc/libs/jwst-storage/src/storage/blobs/mod.rs#L377
  2. How about using get_hash util to pack bodystream into vec<u8> and get length inside put_blob https://github.com/toeverything/OctoBase/blob/f2b25ceb513eae9db40f644668b4b06e46412ecc/libs/jwst-storage/src/storage/blobs/local_db.rs#L166
  3. There seems not to be other workaround here if we cannot delete folder... How about implementing it first, as we prefer to use cloudflare R2 as s3 storage, I'll look over if its api support this later.
  4. Just query the db, I think we can leave this method blank~ https://github.com/toeverything/OctoBase/blob/f2b25ceb513eae9db40f644668b4b06e46412ecc/libs/jwst-storage/src/storage/blobs/local_db.rs#L71

Yes! Hey please note that there's a upstream PR: https://github.com/toeverything/OctoBase/pull/466/files#diff-8ac295f21eb23dcfea53d111ba03970a4323656dc9483febd0c5eec4b081b435R7, For current db implementation, can reference:

#[sea_orm(table_name = "s3_blobs")]
pub struct Model {
    #[sea_orm(primary_key, auto_increment = false)]
    pub workspace: String,
    #[sea_orm(primary_key, auto_increment = false)]
    pub hash: String,
    pub length: i64,
    pub timestamp: DateTimeWithTimeZone,
    #[sea_orm(primary_key, auto_increment = false)]
    pub params: String,
}
Xuanwo commented 1 year ago
  1. can reference here, it's a image query optimization

For s3 storage, we can simply ignore this arg?

  1. How about using get_hash util to pack bodystream into vec<u8> and get length inside put_blob

That's Ok, we can make it work first.

  1. use cloudflare R2 as s3 storage, I'll look over if its api support this later.

Both R2 and S3 doesn't support it. OpenDAL implements this by list all files and calling delete multiple objects. We can implement in this way and discuss later.

  1. Just query the db, I think we can leave this method blank~

Good!

thorseraq commented 1 year ago

For s3 storage, we can simply ignore this arg?

workspace/hash is enough for s3 querying, agree to ignore it 😈

Both R2 and S3 doesn't support it. OpenDAL implements this by list all files and calling delete multiple objects. We can implement in this way and discuss later.

OK~ LGTM

Xuanwo commented 1 year ago

I will start this issue after https://github.com/toeverything/OctoBase/pull/466

thorseraq commented 1 year ago

Hey, there are a few changes since our last discussion:

  1. S3Storage was renamed to BucketStorage, this may represent more general key-value like storage medium

  2. A new BucketBlobStorage trait for Bucket like storage, currently this trait is designed for s3 storage and contains only necessary blob operations, how does this seem to you? Please feel free to make changes

pub trait BucketBlobStorage<E = JwstError> {
    async fn get_blob(
        &self,
        workspace: Option<String>,
        id: String,
        params: Option<HashMap<String, String>>,
    ) -> JwstResult<Vec<u8>, E>;
    async fn put_blob(
        &self,
        workspace: Option<String>,
        hash: String,
        blob: Vec<u8>,
    ) -> JwstResult<String, E>;
    async fn delete_blob(&self, workspace: Option<String>, id: String) -> JwstResult<bool, E>;
    async fn delete_workspace(&self, workspace_id: String) -> JwstResult<(), E>;
}