splunk / splunk-ansible

Ansible playbooks for configuring and managing Splunk Enterprise and Universal Forwarder deployments
355 stars 186 forks source link

Address some limitations of the app installation solution #545

Open romain-bellanger opened 4 years ago

romain-bellanger commented 4 years ago

Hello,

While maintaining some clusters in Kubernetes using the alpha version of the splunk-operator, one of the main issues which are facing is related to the deployment of Splunk applications. I will open a dedicated ticket in the repository of the operator for problems specifically related to the operator, and focus here on the app installation solution provided by splunk-ansible.

Context

In our traditional environment, we have a lot of applications already maintained in git repositories, automatically validated, packaged to tarballs and placed into an artifact repository through a CI pipeline. This process is used for the internal configuration of our clusters (pipelines, indexes, inputs, authentication, access control) as well as user applications (e.g. dashboards and alerts).

Several of these applications were built with the help of Splunk PS, who generally recommended to put a small set of config files into apps with a meaningful names to ease the life of Splunk support. So our base config common to all clusters is already composed of a dozen of small apps / tarballs.

We are also trying to avoid duplicating configuration. So the same "base" apps are loaded to many cluster, and only the necessary settings are managed through cluster specific apps. Precedence and "composition" are used to achieve this. For instance our authentication / access control can be composed of 4 apps:

This type of solution is also used to define the volumes and default indexes paths and settings in a reusable app, separately from the specific indexes definition for a given cluster.

Our understanding of the solution

Is this understanding correct?

Experienced limitations and issues

Is there already any solution available for these issues which I missed?

Proposals

Any other solution to address these issues would be very welcome, this is only to share some ideas...

Deploying apps to both the cluster-master and indexers, or both the deployer and search-heads, does not seem to be possible with the current configuration structure. A new property apps_install_paths could be defined in the ansible configuration with the following structure:

apps_install_paths:
  /opt/splunk/etc/apps:
    - https://repo-url/prefix_app1.tar.gz
  /opt/splunk/etc/master-apps:
    - https://repo-url/prefix_app2.tar.gz
apps_control_regex: '^prefix_.*'

Instead of using the apps/local API, the tarballs could be directly extracted to the target path. I think it is expected to not always have self-contained apps, and it seems to me that the install API either validates or activates them individually. In my perspective, the cluster-bundle validation (currently skipped) would be a better solution as it validates the apps altogether.

Splunk PS recommended us to prefix all our apps with "ama_" to identify apps the apps which we install. In our traditional environment, our ansible playbooks only enforce the content of the app directories which have this prefix, meaning removing the app if not anymore part of the configuration, or cleaning up files not contained in the tarballs. This solution could be considered here, using the apps_control_regex property. The regex could also select the full name of some Splunk apps if needed.

I understand that it could be expensive and confusing to maintain the two solutions...

nwang92 commented 4 years ago
Is this understanding correct?

Sounds like you've got it figured out to me! Few notes:

For your intended use cases:

But overall, I like your proposals and I'll relay them around. I'm not opposed to the idea of some feature-flag that does wipe an app if it's no longer in the default.yml/environment variable, but obviously that comes with a few caveats we'd want to document :)

romain-bellanger commented 4 years ago

Hi @nwang92, many thanks for this feedback.

inventory "randomizes" the order of this list - if this is an issue, we can easily address it. Theoretically app install order shouldn't matter, as long as the final state has everything (I don't believe you can control order of apps pushed during deployer/cluster-master bundles?)

The randomization of the order of the apps does not matter for the bundle deployment, but this seems to matter for the calls to apps/local REST API. It is not the behavior of the indexer cluster which changes, it's the behavior of the playbook execution, during the initial installation one by one of apps on the cluster master. I patched the docker images to change the set to a list, and I could make a the playbook succeed by placing the apps in a specific order, while it was failing otherwise. This was the case of LDAP setup from a cloud region (described in my initial comment), for which we use the same config as the clusters running in our DCs, but use app precedence to override the URL with a proxy/cache (main LDAP server behind a firewall), or provide encrypted credentials. The apps with precedence had to be installed first because of the API, while this doesn't matter from bundle perspective. Then disabling was done in the same order as install, and also took a lot of time on LDAP related app, I guess because proxy and credentials were disabled first, but I didn't attempt doing it in the opposite order.

My concern here not really that apps are extracted in a random order, this should be perfectly fine and I don't really want to worry about the order in which the apps are listed. The problem is that the order shouldn't impact on the behavior of the playbook or Splunk.

apps located in etc/apps are then disabled - this only applies to cluster masters and deployers. I believe this is intended to be aligned with Splunk best-practices, as these instances are more of "administrative" roles and shouldn't be doing any of the heavy-lifting that search heads/peers are for. We can certainly think of ways to open this up though.

For sure I don't want the apps for indexers to be enabled on the cluster-master! :-) Just the possibility to deploy different apps, specific to the cluster-master or deployer, which would not be disabled.

I can see your use case for LDAP though - I was ultimately thinking of exposing a separate parameter to auto-configure various auth settings

This would cover LDAP, but exposing settings one by one might not be as efficient as just opening app configuration. LDAP has a lot of settings, and access control, so role definition must also be covered. Some people might use SAML instead. I also mentioned health thresholds, timeouts, or bucket fixup settings which can be useful, and maybe some new features will be added in the future... A solution to load apps to cluster-master and deployer should cover all at once, including any future settings, and would probably not require much more effort to develop than exposing LDAP settings from ansible config.

The apps are installed and then disabled one by one - correct, again only for deployer + cluster master. I wasn't aware that this disable task can take so long in those cases. I suppose the only thing to blame might be "bad" configs, although that might suggest that the apps are highly coupled and can't function independently? Switching this out to ordered procedures is an easy ask though.

Some of our apps are highly coupled can't function independently. As explained, we reuse an app containing LDAP settings and access control common to multiple clusters, and then only override the URL to target a proxy, add bind credentials encrypted we the specific encryption key of the cluster, or add new roles on each specific cluster, in small specific apps. We also use this for index paths: the volumes and paths and default settings are defined in an app for indexers and an app for search heads (path mandatory even if nothing indexed), and the indexes are defined in a separate app loaded to both (for search autocompletion on searchheads) to avoid maintaining the list of indexes twice. So the app containing the list of index doesn't function without the volumes and path definitions. Preserving the order of apps could be a solution... but ideally, the order should not matter.

Part of the reason we chose to leverage the Splunk REST API for app installation is the hope to improve these features going forward

I understand, this is fine from my perspective, the problem is only validating apps individually (could be address with an API parameter to turn validation off, exposed from ansible config), and installation to etc/apps first which makes the process quite complex and could have undesired effect on the cluster master where the apps for indexers are temporarily installed and enabled (could be addressed with an API parameter specifying the installation path)