Address some limitations of the app installation solution

romain-bellanger commented 4 years ago

Hello,

While maintaining some clusters in Kubernetes using the alpha version of the splunk-operator, one of the main issues which are facing is related to the deployment of Splunk applications. I will open a dedicated ticket in the repository of the operator for problems specifically related to the operator, and focus here on the app installation solution provided by splunk-ansible.

Context

In our traditional environment, we have a lot of applications already maintained in git repositories, automatically validated, packaged to tarballs and placed into an artifact repository through a CI pipeline. This process is used for the internal configuration of our clusters (pipelines, indexes, inputs, authentication, access control) as well as user applications (e.g. dashboards and alerts).

Several of these applications were built with the help of Splunk PS, who generally recommended to put a small set of config files into apps with a meaningful names to ease the life of Splunk support. So our base config common to all clusters is already composed of a dozen of small apps / tarballs.

We are also trying to avoid duplicating configuration. So the same "base" apps are loaded to many cluster, and only the necessary settings are managed through cluster specific apps. Precedence and "composition" are used to achieve this. For instance our authentication / access control can be composed of 4 apps:

The base app defining all LDAP settings and access common to many clusters, deployed to both searchhead and indexers
Optionally an app overriding the URL of LDAP server to connect through a LDAP proxy/cache in a specific cloud region (SH + IDX)
An cluster specific app containing the LDAP bind credentials encrypted with the encryption key of the cluster (SH + IDX)
An app defining the access control specific to a cluster, generally only deployed to searchheads

This type of solution is also used to define the volumes and default indexes paths and settings in a reusable app, separately from the specific indexes definition for a given cluster.

Our understanding of the solution

The apps are defined as a list in the ansible configuration
The inventory "randomizes" the order of this list by placing its content into a set in the inventory script
The apps tarballs are downloaded to /tmp and installed one by one locally to /opt/splunk/etc/apps using the apps/local REST API. Installation through direct tarball extraction is only used for ITSI and is very basic (e.g. no file cleanup on new version)
On cluster master, the apps are copied to etc/master-apps, and on deployer to stc/shcluster/apps, and the cluster-bundle or shcluster-bundle deployments are triggered, skipping the bundle validation on cluster master
Note: the documentation mentions that the local directory is not copied to the master-apps. However the complete app is disabled and this doesn't seem to be usable to configure the cluster master itself
The apps located in etc/apps are then disabled one by one

Is this understanding correct?

Experienced limitations and issues

It does not seem possible to use apps to configure the cluster master or the deployer themselves. This is especially a problem for the cluster-master as we would like to configure LDAP authentication, access control to API for our monitoring framework, customize red/yellow health threshold connected to our alerting, and configure various settings based on recommendations from Splunk PS or support (e.g. timeouts, heartbeats, fixup time...). I guess this might be possible through deployment server, but this complexifies the setup.
The apps are installed and then disabled one by one, and Splunk seems to execute some internal validation during these API calls. Some calls either hangs or takes several minutes when installing or disabling apps, especially those related to LDAP settings. This might be related to the usage of "composite" apps or precedence. As the app order is randomized by the inventory script, this leads to various behaviors, some successful, some hanging until pod probe failure. We are currently patching the docker images to use the extraction installation method (used for ITSI) for our apps to workaround this issue, but this doesn't deal at all with file cleanup on app version upgrade / downgrade (isn't this a problem for ITSI?).
Uninstalling an app (e.g. for rollback) does not seem to be supported (on indexer-cluster it can be manually triggered from cluster-master UI, but not from configuration, and rollback from master UI would probably be reverted by a cluster-master pod restart which would retrigger the ansible playbook). This is a problem as we want to integrate the cluster deployment and config updates into a CI/CD pipeline connected to our change management, which we also want to use for rollback.

Is there already any solution available for these issues which I missed?

Proposals

Any other solution to address these issues would be very welcome, this is only to share some ideas...

Deploying apps to both the cluster-master and indexers, or both the deployer and search-heads, does not seem to be possible with the current configuration structure. A new property apps_install_paths could be defined in the ansible configuration with the following structure:

apps_install_paths:
  /opt/splunk/etc/apps:
    - https://repo-url/prefix_app1.tar.gz
  /opt/splunk/etc/master-apps:
    - https://repo-url/prefix_app2.tar.gz
apps_control_regex: '^prefix_.*'

Instead of using the apps/local API, the tarballs could be directly extracted to the target path. I think it is expected to not always have self-contained apps, and it seems to me that the install API either validates or activates them individually. In my perspective, the cluster-bundle validation (currently skipped) would be a better solution as it validates the apps altogether.

Splunk PS recommended us to prefix all our apps with "ama_" to identify apps the apps which we install. In our traditional environment, our ansible playbooks only enforce the content of the app directories which have this prefix, meaning removing the app if not anymore part of the configuration, or cleaning up files not contained in the tarballs. This solution could be considered here, using the apps_control_regex property. The regex could also select the full name of some Splunk apps if needed.

I understand that it could be expensive and confusing to maintain the two solutions...

nwang92 commented 4 years ago

Is this understanding correct?

Sounds like you've got it figured out to me! Few notes:

inventory "randomizes" the order of this list - if this is an issue, we can easily address it. Theoretically app install order shouldn't matter, as long as the final state has everything (I don't believe you can control order of apps pushed during deployer/cluster-master bundles?)
apps located in etc/apps are then disabled - this only applies to cluster masters and deployers. I believe this is intended to be aligned with Splunk best-practices, as these instances are more of "administrative" roles and shouldn't be doing any of the heavy-lifting that search heads/peers are for. We can certainly think of ways to open this up though.

For your intended use cases:

It does not seem possible to use apps to configure the cluster master or the deployer themselves - correct, that's how it is as it stands today. I can see your use case for LDAP though - I was ultimately thinking of exposing a separate parameter to auto-configure various auth settings, something by the way of:
```
splunk:
auth:
ldap:
   ...
saml:
   ...
```
which may fulfill this particular hole, but not necessarily the one where you'd like any arbitrary app installed.
The apps are installed and then disabled one by one - correct, again only for deployer + cluster master. I wasn't aware that this disable task can take so long in those cases. I suppose the only thing to blame might be "bad" configs, although that might suggest that the apps are highly coupled and can't function independently? Switching this out to ordered procedures is an easy ask though.
Uninstalling an app (e.g. for rollback) does not seem to be supported - correct again, ideally we can declaratively designate apps such that if an app gets removed from splunk.apps_location, it will get uninstalled, but that's currently not the case. If you're using persistent volumes, you do get some form of rollback, but if a new version introduces a new file and it gets rolled back into an old version, I don't believe Splunk is aware enough to remove that file on disk. Part of the reason we chose to leverage the Splunk REST API for app installation is the hope to improve these features going forward. The only alternative I can think of would be to "delete" the app, but that's not exactly a great rollback mechanism, nor is it safe as we don't want to incur any loss of local directory configs.

But overall, I like your proposals and I'll relay them around. I'm not opposed to the idea of some feature-flag that does wipe an app if it's no longer in the default.yml/environment variable, but obviously that comes with a few caveats we'd want to document :)

romain-bellanger commented 4 years ago

Hi @nwang92, many thanks for this feedback.

inventory "randomizes" the order of this list - if this is an issue, we can easily address it. Theoretically app install order shouldn't matter, as long as the final state has everything (I don't believe you can control order of apps pushed during deployer/cluster-master bundles?)

The randomization of the order of the apps does not matter for the bundle deployment, but this seems to matter for the calls to apps/local REST API. It is not the behavior of the indexer cluster which changes, it's the behavior of the playbook execution, during the initial installation one by one of apps on the cluster master. I patched the docker images to change the set to a list, and I could make a the playbook succeed by placing the apps in a specific order, while it was failing otherwise. This was the case of LDAP setup from a cloud region (described in my initial comment), for which we use the same config as the clusters running in our DCs, but use app precedence to override the URL with a proxy/cache (main LDAP server behind a firewall), or provide encrypted credentials. The apps with precedence had to be installed first because of the API, while this doesn't matter from bundle perspective. Then disabling was done in the same order as install, and also took a lot of time on LDAP related app, I guess because proxy and credentials were disabled first, but I didn't attempt doing it in the opposite order.

My concern here not really that apps are extracted in a random order, this should be perfectly fine and I don't really want to worry about the order in which the apps are listed. The problem is that the order shouldn't impact on the behavior of the playbook or Splunk.

apps located in etc/apps are then disabled - this only applies to cluster masters and deployers. I believe this is intended to be aligned with Splunk best-practices, as these instances are more of "administrative" roles and shouldn't be doing any of the heavy-lifting that search heads/peers are for. We can certainly think of ways to open this up though.

For sure I don't want the apps for indexers to be enabled on the cluster-master! :-) Just the possibility to deploy different apps, specific to the cluster-master or deployer, which would not be disabled.

I can see your use case for LDAP though - I was ultimately thinking of exposing a separate parameter to auto-configure various auth settings

This would cover LDAP, but exposing settings one by one might not be as efficient as just opening app configuration. LDAP has a lot of settings, and access control, so role definition must also be covered. Some people might use SAML instead. I also mentioned health thresholds, timeouts, or bucket fixup settings which can be useful, and maybe some new features will be added in the future... A solution to load apps to cluster-master and deployer should cover all at once, including any future settings, and would probably not require much more effort to develop than exposing LDAP settings from ansible config.

The apps are installed and then disabled one by one - correct, again only for deployer + cluster master. I wasn't aware that this disable task can take so long in those cases. I suppose the only thing to blame might be "bad" configs, although that might suggest that the apps are highly coupled and can't function independently? Switching this out to ordered procedures is an easy ask though.

Some of our apps are highly coupled can't function independently. As explained, we reuse an app containing LDAP settings and access control common to multiple clusters, and then only override the URL to target a proxy, add bind credentials encrypted we the specific encryption key of the cluster, or add new roles on each specific cluster, in small specific apps. We also use this for index paths: the volumes and paths and default settings are defined in an app for indexers and an app for search heads (path mandatory even if nothing indexed), and the indexes are defined in a separate app loaded to both (for search autocompletion on searchheads) to avoid maintaining the list of indexes twice. So the app containing the list of index doesn't function without the volumes and path definitions. Preserving the order of apps could be a solution... but ideally, the order should not matter.

Part of the reason we chose to leverage the Splunk REST API for app installation is the hope to improve these features going forward

I understand, this is fine from my perspective, the problem is only validating apps individually (could be address with an API parameter to turn validation off, exposed from ansible config), and installation to etc/apps first which makes the process quite complex and could have undesired effect on the cluster master where the apps for indexers are temporarily installed and enabled (could be addressed with an API parameter specifying the installation path)

splunk / splunk-ansible