Closed fhennig closed 6 months ago
I would like to point out again that IMHO we should allow extension-specific properties. Just collecting my messages condensed below:
Looking at e.g. github.com/orgs/stackabletech/discussions/15 I think we might want to add dedicated extensions in the future - in this case for google cloud extensions with bucket name and credentials secret.
Have not spend many thoughts on this, my only requirement is that we can add properties to extensions. Does not need to be right now, just give us the chance to add that non-breaking. For avro specifically I don't see a reason to not always enable it. CRD could look something like
kind: DruidCluster
spec:
clusterConfig:
additionalExtensions:
druid-google-extensions: # optional
enabled: true # defaults to false
credentialsSecret: my-google-credentials # mandatory, gets put into GOOGLE_APPLICATION_CREDENTIALS
bucket: my-bucket # mandatory
prefix: / # mandatory, defaults to /
maxListingLength: 1024 # optional
druid-avro-extensions: # optional
enabled: true # defaults to false
# All extensions that we don't enumerate explicitly can be loaded here
customExtensions:
- druid-avro-extensions
- druid-bloom-filter
- gce-extensions
As a first step we could simply add all extensions that don't need any configuration options (such as avro)
Moving to voting.
I didn't comment while it was on RfC, so feel free to ignore this.
I would much prefer an explicit solution that doesn't differentiate between additionalExtensions
and customExtensions
but provides an orthogonal interface (i.e. a list) as most users don't care about the Rust implementation, and this complicates the CRD unnecessarily for them.
+1 on that the implementation should not leak into the API :+1:
The reason we can not simply have a list of strings is that some extensions require additional configuration (such as GCP credentials).
The reason we need customExtensions
in addition to additionalExtensions
is that we can/don't want to enumerate (and officially support) all 50000 extensions Druid has. So the idea is to have a generic fallback which does not take any configuration - so it can just be a list of strings. As an alternative we could model it the same way to be consistent, but that looks a bit bloated
additionalExtensions:
# All extensions that we don't enumerate explicitly can be loaded here
customExtensions:
random-extension-1: # optional
enabled: true # defaults to false
random-extension-2: # optional
enabled: true # defaults to false
Anyway @labrenbe it looks like the CRD change in the Issues description was not updated. I guess this is not intentional?
Maybe a better alternative would be to use Either
in combination with the following untagged (de)serialization: https://docs.rs/either/latest/either/serde_untagged/index.html. With that approach, we could have one list only, and support either a plain list of strings or a list of objects (mixing both might be very tricky or even impossible - using Either<Vec<String>, HashMap<String, Option<ExtensionConfig>>
might work (I just made this up on the spot)).
~We would only need to implement support for JsonSchema
in the upstream schemars
crate to enable proper CRD generation for the Either
type.~ Nvm, that is already supported.
I would like a hybrid approach to extension configuration following a bit Trino's catalog model.
The Druid CRD should only allow a list of extensions that require no configuration.
Extensions with configuration should be their own CRDs that are referenced from the Druid cluster with a extensionLabelSelector
or something like that.
This would have multiple advantages IMO:
Anyway @labrenbe it looks like the CRD change in the Issues description was not updated. I guess this is not intentional?
Sorry for the confusion, I left the original CRD change proposal in the Issue description and added the updated proposal below that.
Sorry for the confusion, I left the original CRD change proposal in the Issue description and added the updated proposal below that.
Do you plan on updating it, or am I allowed to modify it?
Modified the issue to reflect the updated proposal
I am not a big fan of this design either, but I don't have a concrete reason to veto it. I just think it's overly convoluted, I believe the main reason people wanna enable extensions is to add support for different data formats, and for that no config is needed, and we could just have a list of strings and be done with it. Extensions are not at all as important as catalogs in trino, and I would like to keep the config simple, even if it maybe doesn't support everything (yet). But like I said, I'm not vetoing the current design, just wanted to share my thoughts too.
I've given it some thought and agree that it would be best to keep this a list of strings with no configuration and will update the issue with the justification.
It's fine for me to just implement a Vec<String>
for know, but I would really want to move it below a key, e.g spec.clusterConfig.additionalExtensions.customExtensions
, so that we can non-breaking later add extensions that require configurations (such as the google one). The config can not simply be put in configOverrides, as e.g. Secrets need to be propagates to env vars and referenced or secrets mounted somewhere or a certificate needs to be added to the truststore etc.
I think for the google extension (and similar features like this) I'd rather just have a clusterConfig.gcp
config key. I think that an extension is neede for that is then more an implementation detail. same as how we include the hdfs extension when HDFS deep storage is configured.
Moving to voting.
The decision has been accepted.
Do we have documentation for this? If so: Could you add a link to the generated nightly docs please?
nightly docs here: https://docs.stackable.tech/home/nightly/druid/usage-guide/extensions
For any developers attempting to do this in the current version (as of writing, 24.3) it is described in https://github.com/orgs/stackabletech/discussions/15.
In essence, you have to specify the list of default extensions (which can be seen in the router GUI on the main page) and then additionally whatever additional extensions you would like to use.
As a user I want to know when which extensions get loaded by default and also how I can configure additional extensions to load. Loading more extensions should be easy.
Improve loading extensions
Currently you have to use a configOverride to load a new extension. This is error prone, because you have to not only specify your additional extension, but also all the other extensions that are configured by the operator. This could either be a simple list of strings to add, or a map with the option to have configuration settings for the extensions in there.
CRD Change
We propose the following change to the DruidCluster CRD:
Only a list of extensions to load can be specified to keep the CRD as simple and clean as possible.
Many extensions that require additional configuration are already being loaded conditionally based on other parts of the CRD, e.g. authentication, deep storage or metadata storage. If in some cases this is not enough (e.g. because we haven't added an option for deep storage yet) the extension can still be loaded by adding it to the additionalExtensions list with the extension-specific configuration being added using configOverrides.
Document which extensions are loaded by default
We load some extensions conditionally, such as the google or HDFS extensions or the pac4j extension for OIDC support. We should document when which extension gets loaded. Also the list of the default extensions that we load. Maybe also add a link to the Druid documentation where possible extensions are listed.