Storage configuration for Azure or S3

mbaykara commented 3 months ago

Is your feature request related to a problem? Please describe. When deploying Quickwit via a Helm chart and using a storage backend like AWS or Azure, there's frustration due to unclear or undocumented schema for configuring default_index_root_uri and metastore_uri specifically for Azure.

Describe the solution you'd like

Following point should be enhanced imho

While configuring the storage backend for Azure, the schemas for default_index_root_uri and metastore_uri should be more clear and documented. Currently, the assumed format is:

default_index_root_uri: <s3|azure>://<bucket|container>/<path>
metastore_uri: <s3|azure>://<bucket|container>/<path>

This format might not be intuitive for users. An improved version could be:

default_index_root_uri: <s3|azure>://<storage_account>/<bucket|container>
metastore_uri: <s3|azure>:<s3|azure>://<storage_account>/<bucket|container>

Alternatively, for enhanced clarity and user experience, users could provide configuration as follows:

config:
  # Storage configuration.
  storage:
    azure:
      account: "mystorageaccount"
      access_key: "somesecretkeys"
      container: "quickwit-container"
  metastore_uri: azure://{account}/{container} # Auto-generated based on configuration info from storage
  default_index_root_uri: azure://{account}/{container} # Auto-generated

This approach ensures better clarity and ease of understanding for users configuring the Azure storage backend.

Creating manually indexes

Currently, users have to create indexes manually via curl on the storage, specifying the index name and schema. This is obviously an anti-pattern for automation. The initial index should be created during deployment, with an eventual configuration parameter.

guilload commented 3 months ago

I don't think your proposal is feasible because storage configurations are not just for defining metastore and index root URIs, so adding a bucket or container key there does not always make sense. In addition, we can't generate the metastore URI as you suggested because not all users want to use a file-backed metastore.

However, I agree with you. Our documentation is not crystal clear and lacks specific examples for Azure, which would make the experience less confusing. We will improve that.

mbaykara commented 3 months ago

I understand the issue with defining URIs, but could you please explain why defining a bucket or container doesn't seem sensible? Could you provide a use case where this is commonly found? It seems that configuring storage similar to S3 without utilizing a bucket or container would be less sense than with them. Your insights would be greatly appreciated.

guilload commented 3 months ago

It does not seem sensible to define a bucket or container in Quickwit storage configurations at the moment because they are currently designed to allow configuring access to any bucket or container for a given region or endpoint.

With the current design, you can define a single storage config with some credentials and store indexes in multiple buckets using different index URIs:

# node config
storage:
    s3:
      access_key_id: ***
      secret_access_key: ***

# indexing config foo
index_uri: s3://<bucket-foo>/indexes

# indexing config bar
index_uri: s3://<bucket-bar>/indexes # maybe you're using a different storage class in this bucket

I think we want to support per-bucket storage configurations in the future using another level of nesting:

storage:
    s3: # default
      access_key_id: ***
      secret_access_key: ***

    s3.bucket-foo:
        access_key_id: ***
        secret_access_key: ***

    s3.bucket-bar:
        access_key_id: ***
        secret_access_key: ***

fulmicoton commented 3 months ago

s3.bucket-foo

Using bucket here has its issues too: you could have several accounts with the same bucket name.

storage:
    s3.account_id: # default
      access_key_id: ***
      secret_access_key: ***

And the uri

      s3://account_id@bucket_id/path

Some customers have been asking making it possible for tenants to come with their own S3 bucket. If we do some work on this, this would probably mean putting this info in the metastore.

fulmicoton commented 3 months ago

Currently, users have to create indexes manually via curl on the storage, specifying the index name and schema. This is obviously an anti-pattern for automation. The initial index should be created during deployment, with an eventual configuration parameter.

@mbaykara I did not understand what you meant. What do you mean by needing to "curl on the storage"?

mbaykara commented 3 months ago

configuration parameter.

@fulmicoton , No, I mean i.e., while you deploy with Helm, you're allowed to define indexes as follows:

  indexes:
    version: 0.7
    index_id: fluentbit-k8s-logs # 
    doc_mapping:
      mode: dynamic
      field_mappings:
        - name: timestamp
        ...
        ... #omitted

The index_id is currently supposed to be created via cURL. If it is not created early, the indexer throws errors such as index_id with fluentbit-k8s-logs not found. Then you have to create it using:

curl -XPOST http://quickwit-indexer:7280/api/v1/indexes -H "Content-Type: application/yaml" --data-binary @fluentbit-k8s-logs.yml

Since the indexer service is consuming the configuration above, it should check to index the data with the given index_id. If it does not exist, it should be created, or alternatively, it should always try to create it if it exists and continue perform rest steps.

quickwit-oss / quickwit

Storage configuration for Azure or S3 #4797

Following point should be enhanced imho

Creating manually indexes