splunk_cluster_master : Apply cluster bundle tasks fails when cluster is in process of applying another bundle. Causes kubernetes pod to go in a crashback loop.

cderocco5 commented 9 months ago

When restarting a splunk cluster manager/master kubernetes pod. The pod restart will fail if there is already a bundle in the process of applying to indexers.

Error message:

TASK [splunk_cluster_master : Apply cluster bundle] ****************************
fatal: [localhost]: FAILED! => {
    "changed": false,
    "cmd": [
        "/opt/splunk/bin/splunk",
        "apply",
        "cluster-bundle",
        "-auth",
        "admin:CQL5A/oeVbda/I711kFP2PhFXvIv3k2w",
        "--skip-validation",
        "--answer-yes"
    ],
    "delta": "0:00:01.015543",
    "end": "2023-12-06 15:28:24.842204",
    "failed_when_result": true,
    "rc": 0,
    "start": "2023-12-06 15:28:23.826661"
}

STDOUT:

Encountered some errors while applying the bundle.

STDERR:

WARNING: Server Certificate Hostname Validation is disabled. Please see server.conf/[sslConfig]/cliVerifyServerName for details.
Cannot apply (or) validate configuration settings. Rolling restart of the peers is in progress.

Steps to recreate:

1) apply cluster bundle from the cluster manager/master. splunk apply cluster-bundle 2) delete the kubernetes splunk cluster manager/master pod 3) pod will try to restart and fail with the above error message.

Expected Behavior:

Ansible task is able to detect that a bundle is already being applied and does not run the "Apply Cluster Bundle" task or the "Apply cluster bundle" should ignore the error and not cause the pod to crash on a restart. Or have a default.yml key that disables the "Apply Cluster Bundle" task. This will prevent any unexpected indexer rolling restarts from happening in a pod or node dies.

martinr103 commented 9 months ago

I believe that this is very much related to another issue that was already closed 3 years ago, without actually being resolved.

You might want to review my comments to the closed issue, that I wrote 2 weeks ago: https://github.com/splunk/splunk-ansible/issues/35#issuecomment-1824222893

adityapinglesf commented 9 months ago

thanks @martinr103 and @cderocco5 for reporting the concern again. I am also in touch with the support team @cderocco5 has likely interacted with. Going over the issue and possible resolution. Will get back to you soon.

cderocco5 commented 9 months ago

Thanks @adityapinglesf . When you have 455 indexers and 6 PBs of data. Rolling restarts take over 30 hours. We need a way for the cluster manager docker image to stop doing a one indexer at a time rolling restart when the cluster manager pod restarts.

cderocco5 commented 7 months ago

Any progress on this issue?

splunk / splunk-ansible