splunk / splunk-operator

Splunk Operator for Kubernetes
Other
209 stars 115 forks source link

Splunk Operator: indexers don't start if search-heads still starting #1390

Open yaroslav-nakonechnikov opened 3 weeks ago

yaroslav-nakonechnikov commented 3 weeks ago

Please select the type of request

Bug

Tell us more

Describe the request

[yn@ip-100-65-8-59 /]$ kubectl get pods -n splunk-operator
NAME                                                  READY   STATUS    RESTARTS   AGE
splunk-43105-cluster-manager-0                        1/1     Running   0          19m
splunk-43105-license-manager-0                        1/1     Running   0          30m
splunk-c-43105-standalone-0                           1/1     Running   0          30m
splunk-e-43105-deployer-0                             0/1     Running   0          8m15s
splunk-e-43105-search-head-0                          0/1     Running   0          8m15s
splunk-e-43105-search-head-1                          0/1     Running   0          8m15s
splunk-e-43105-search-head-2                          /1     Running   0          8m15s
splunk-operator-controller-manager-58b545f67c-8rrhx   2/2     Running   0          31m

and then:

NAME                                                  READY   STATUS    RESTARTS   AGE
splunk-43105-cluster-manager-0                        1/1     Running   0          21m
splunk-43105-license-manager-0                        1/1     Running   0          32m
splunk-c-43105-standalone-0                           1/1     Running   0          32m
splunk-e-43105-deployer-0                             0/1     Running   0          11m
splunk-e-43105-search-head-0                          1/1     Running   0          11m
splunk-e-43105-search-head-1                          1/1     Running   0          11m
splunk-e-43105-search-head-2                          1/1     Running   0          11m
splunk-operator-controller-manager-58b545f67c-8rrhx   2/2     Running   0          34m
splunk-site3-43105-indexer-0                          0/1     Running   0          2m17s
splunk-site3-43105-indexer-1                          0/1     Running   0          2m17s
splunk-site3-43105-indexer-2                          0/1     Running   0          2m17s

this is unbelievable, and extremely strange that still, in 2.6.1 there is a dependency check between splunk search-heads and indexers!!!!

Expected behavior Indexers should start without dependency of search-heads!

yaroslav-nakonechnikov commented 3 weeks ago

it was already reported before #1260, and then there were 2 calls, where i described why logic with dependency is broken for kubernetes deployment.

and now can test 2.6.1 and we still see, that part of platform can't be started just because of problematic logic. old case: 3448046

vivekr-splunk commented 3 weeks ago

@yaroslav-nakonechnikov we will get back to you with regard to this issue.

yaroslav-nakonechnikov commented 1 week ago

so, sadly, this is extremely painful, as there maybe issues like that:

FAILED - RETRYING: [localhost]: Initialize SHC cluster config (2 retries left).
FAILED - RETRYING: [localhost]: Initialize SHC cluster config (1 retries left).

TASK [splunk_search_head : Initialize SHC cluster config] **********************
fatal: [localhost]: FAILED! =>

{ "attempts": 60, "changed": false, "cmd": [ "/opt/splunk/bin/splunk", "init", "shcluster-config", "-auth", "admin:j3Q9SWJlLBOlc3RWejMnUb6e", "-mgmt_uri", "https://splunk-e-43345-search-head-1.splunk-e-43345-search-head-headless.splunk-operator.svc.cluster.local:8089", "-replication_port", "9887", "-replication_factor", "3", "-conf_deploy_fetch_url", "https://splunk-e-43345-deployer-service:8089", "-secret", "RNr25biFMA4Z3SUbXB3VGwW6", "-shcluster_label", "she_cluster" ], "delta": "0:00:00.806237", "end": "2024-10-31 08:05:54.588881", "rc": 24, "start": "2024-10-31 08:05:53.782644" }
STDERR:

WARNING: Server Certificate Hostname Validation is disabled. Please see server.conf/[sslConfig]/cliVerifyServerName for details.
Login failed

MSG:

non-zero return code

PLAY RECAP *********************************************************************
localhost : ok=132 changed=11 unreachable=0 failed=1 skipped=68 rescued=0 ignored=0

problem in that section, as i understand: https://github.com/splunk/splunk-ansible/blob/53a9a70897896e279b43478583b13256e75894a2/roles/splunk_search_head/tasks/search_head_clustering.yml#L6

and search heads in infinitive loop, which leads to none of indexers are started.

it happened on splunk-operator 2.6.1 and splunk 9.1.6

yaroslav-nakonechnikov commented 1 week ago

extremly strange that standalone instance started without issues:

NAME                                                  READY   STATUS    RESTARTS      AGE
splunk-43345-cluster-manager-0                        1/1     Running   1 (70m ago)   79m
splunk-43345-license-manager-0                        1/1     Running   0             79m
splunk-c-43345-standalone-0                           1/1     Running   0             79m
splunk-e-43345-deployer-0                             0/1     Running   0             66m
splunk-e-43345-search-head-0                          0/1     Running   3 (14m ago)   65m
splunk-e-43345-search-head-1                          0/1     Running   3 (14m ago)   65m
splunk-e-43345-search-head-2                          0/1     Running   3 (14m ago)   65m
splunk-operator-controller-manager-5c684d667d-smgdq   2/2     Running   0             80m
yaroslav-nakonechnikov commented 1 week ago

and with this test i can confirm, that 9.1.6 is not working at all.