stackabletech / druid-operator

An Operator for Apache Druid for Stackable Data Platform
Other
10 stars 0 forks source link

Improved extension handling #490

Closed fhennig closed 6 months ago

fhennig commented 11 months ago

As a user I want to know when which extensions get loaded by default and also how I can configure additional extensions to load. Loading more extensions should be easy.

Improve loading extensions

Currently you have to use a configOverride to load a new extension. This is error prone, because you have to not only specify your additional extension, but also all the other extensions that are configured by the operator. This could either be a simple list of strings to add, or a map with the option to have configuration settings for the extensions in there.

CRD Change

We propose the following change to the DruidCluster CRD:

---
kind: DruidCluster
spec:
  clusterConfig:
    additionalExtensions:
    - druid-avro-extensions
    - druid-bloom-filter

Only a list of extensions to load can be specified to keep the CRD as simple and clean as possible.

Many extensions that require additional configuration are already being loaded conditionally based on other parts of the CRD, e.g. authentication, deep storage or metadata storage. If in some cases this is not enough (e.g. because we haven't added an option for deep storage yet) the extension can still be loaded by adding it to the additionalExtensions list with the extension-specific configuration being added using configOverrides.

Document which extensions are loaded by default

We load some extensions conditionally, such as the google or HDFS extensions or the pac4j extension for OIDC support. We should document when which extension gets loaded. Also the list of the default extensions that we load. Maybe also add a link to the Druid documentation where possible extensions are listed.

### Tasks
- [ ] <Implementation PR>
- [ ] https://github.com/stackabletech/druid-operator/issues/487
sbernauer commented 7 months ago

I would like to point out again that IMHO we should allow extension-specific properties. Just collecting my messages condensed below:

Looking at e.g. github.com/orgs/stackabletech/discussions/15 I think we might want to add dedicated extensions in the future - in this case for google cloud extensions with bucket name and credentials secret.

Have not spend many thoughts on this, my only requirement is that we can add properties to extensions. Does not need to be right now, just give us the chance to add that non-breaking. For avro specifically I don't see a reason to not always enable it. CRD could look something like

kind: DruidCluster
spec:
  clusterConfig:
    additionalExtensions:
      druid-google-extensions: # optional
        enabled: true # defaults to false
        credentialsSecret: my-google-credentials # mandatory, gets put into GOOGLE_APPLICATION_CREDENTIALS
        bucket: my-bucket # mandatory
        prefix: / # mandatory, defaults to /
        maxListingLength: 1024 # optional
      druid-avro-extensions: # optional
        enabled: true # defaults to false
      # All extensions that we don't enumerate explicitly can be loaded here 
      customExtensions:
        - druid-avro-extensions
        - druid-bloom-filter
        - gce-extensions

As a first step we could simply add all extensions that don't need any configuration options (such as avro)

labrenbe commented 6 months ago

Moving to voting.

razvan commented 6 months ago

I didn't comment while it was on RfC, so feel free to ignore this.

I would much prefer an explicit solution that doesn't differentiate between additionalExtensions and customExtensions but provides an orthogonal interface (i.e. a list) as most users don't care about the Rust implementation, and this complicates the CRD unnecessarily for them.

sbernauer commented 6 months ago

+1 on that the implementation should not leak into the API :+1:

The reason we can not simply have a list of strings is that some extensions require additional configuration (such as GCP credentials).

The reason we need customExtensions in addition to additionalExtensions is that we can/don't want to enumerate (and officially support) all 50000 extensions Druid has. So the idea is to have a generic fallback which does not take any configuration - so it can just be a list of strings. As an alternative we could model it the same way to be consistent, but that looks a bit bloated

    additionalExtensions:
      # All extensions that we don't enumerate explicitly can be loaded here 
      customExtensions:
        random-extension-1: # optional
          enabled: true # defaults to false
        random-extension-2: # optional
          enabled: true # defaults to false
sbernauer commented 6 months ago

Anyway @labrenbe it looks like the CRD change in the Issues description was not updated. I guess this is not intentional?

Techassi commented 6 months ago

Maybe a better alternative would be to use Either in combination with the following untagged (de)serialization: https://docs.rs/either/latest/either/serde_untagged/index.html. With that approach, we could have one list only, and support either a plain list of strings or a list of objects (mixing both might be very tricky or even impossible - using Either<Vec<String>, HashMap<String, Option<ExtensionConfig>> might work (I just made this up on the spot)).

~We would only need to implement support for JsonSchema in the upstream schemars crate to enable proper CRD generation for the Either type.~ Nvm, that is already supported.

razvan commented 6 months ago

I would like a hybrid approach to extension configuration following a bit Trino's catalog model.

The Druid CRD should only allow a list of extensions that require no configuration.

Extensions with configuration should be their own CRDs that are referenced from the Druid cluster with a extensionLabelSelector or something like that.

This would have multiple advantages IMO:

labrenbe commented 6 months ago

Anyway @labrenbe it looks like the CRD change in the Issues description was not updated. I guess this is not intentional?

Sorry for the confusion, I left the original CRD change proposal in the Issue description and added the updated proposal below that.

sbernauer commented 6 months ago

Sorry for the confusion, I left the original CRD change proposal in the Issue description and added the updated proposal below that.

Do you plan on updating it, or am I allowed to modify it?

labrenbe commented 6 months ago

Modified the issue to reflect the updated proposal

fhennig commented 6 months ago

I am not a big fan of this design either, but I don't have a concrete reason to veto it. I just think it's overly convoluted, I believe the main reason people wanna enable extensions is to add support for different data formats, and for that no config is needed, and we could just have a list of strings and be done with it. Extensions are not at all as important as catalogs in trino, and I would like to keep the config simple, even if it maybe doesn't support everything (yet). But like I said, I'm not vetoing the current design, just wanted to share my thoughts too.

labrenbe commented 6 months ago

I've given it some thought and agree that it would be best to keep this a list of strings with no configuration and will update the issue with the justification.

sbernauer commented 6 months ago

It's fine for me to just implement a Vec<String> for know, but I would really want to move it below a key, e.g spec.clusterConfig.additionalExtensions.customExtensions, so that we can non-breaking later add extensions that require configurations (such as the google one). The config can not simply be put in configOverrides, as e.g. Secrets need to be propagates to env vars and referenced or secrets mounted somewhere or a certificate needs to be added to the truststore etc.

fhennig commented 6 months ago

I think for the google extension (and similar features like this) I'd rather just have a clusterConfig.gcp config key. I think that an extension is neede for that is then more an implementation detail. same as how we include the hdfs extension when HDFS deep storage is configured.

labrenbe commented 6 months ago

Moving to voting.

labrenbe commented 6 months ago

The decision has been accepted.

lfrancke commented 6 months ago

Do we have documentation for this? If so: Could you add a link to the generated nightly docs please?

fhennig commented 6 months ago

nightly docs here: https://docs.stackable.tech/home/nightly/druid/usage-guide/extensions

DiscordJim commented 6 months ago

For any developers attempting to do this in the current version (as of writing, 24.3) it is described in https://github.com/orgs/stackabletech/discussions/15.

In essence, you have to specify the list of default extensions (which can be seen in the router GUI on the main page) and then additionally whatever additional extensions you would like to use.