splunk / splunk-operator

Splunk Operator for Kubernetes
Other
210 stars 115 forks source link

Splunk Operator: add some troubleshooting advices. #1293

Closed yaroslav-nakonechnikov closed 4 months ago

yaroslav-nakonechnikov commented 9 months ago

Please select the type of request

Enhancement

Tell us more

Describe the request Need a bit better understanding status of phases for crd's:

$ for crd in $(kubectl get crd | grep splunk | awk -F ' ' '{print $1}'); do echo $crd; kubectl get $crd -n splunk-operator; done
clustermanagers.enterprise.splunk.com
NAME    PHASE   MANAGER   DESIRED   READY   AGE
39188   Ready                               6d1h
clustermasters.enterprise.splunk.com
No resources found in splunk-operator namespace.
indexerclusters.enterprise.splunk.com
NAME          PHASE   MASTER   MANAGER   DESIRED   READY   AGE
site1-39188   Error            Ready     3         3       6d1h
site2-39188   Error            Ready     3         3       6d1h
site3-39188   Error            Ready     3         3       6d1h
site4-39188   Error            Ready     4         3       6d1h
site5-39188   Error            Ready     3         3       6d1h
site6-39188   Error            Ready     3         3       6d1h
licensemanagers.enterprise.splunk.com
NAME    PHASE   AGE
39188   Ready   6d1h
licensemasters.enterprise.splunk.com
No resources found in splunk-operator namespace.
monitoringconsoles.enterprise.splunk.com
NAME    PHASE     DESIRED   READY   AGE
39188   Pending                     6d1h
searchheadclusters.enterprise.splunk.com
NAME      PHASE   DEPLOYER   DESIRED   READY   AGE
e-39188   Error   Error      3         3       6d1h
standalones.enterprise.splunk.com
NAME      PHASE   DESIRED   READY   AGE
c-39188   Ready   1         1       6d1h

Ready - is clear. All works as expected. Pending - is understandable, as pod currently not running. Error - not clear at all. for example:

$ kubectl get pods -n splunk-operator | grep site
splunk-site1-39188-indexer-0                          1/1     Running   0               67m
splunk-site1-39188-indexer-1                          1/1     Running   0               67m
splunk-site1-39188-indexer-2                          1/1     Running   0               67m
splunk-site2-39188-indexer-0                          1/1     Running   0               67m
splunk-site2-39188-indexer-1                          1/1     Running   0               67m
splunk-site2-39188-indexer-2                          1/1     Running   0               67m
splunk-site3-39188-indexer-0                          1/1     Running   0               67m
splunk-site3-39188-indexer-1                          1/1     Running   0               67m
splunk-site3-39188-indexer-2                          1/1     Running   0               67m
splunk-site5-39188-indexer-0                          1/1     Running   0               67m
splunk-site5-39188-indexer-1                          1/1     Running   0               67m
splunk-site5-39188-indexer-2                          1/1     Running   0               67m
splunk-site6-39188-indexer-0                          1/1     Running   0               67m
splunk-site6-39188-indexer-1                          1/1     Running   0               67m
splunk-site6-39188-indexer-2                          1/1     Running   0               67m

so, i'd expect that it should be Ready for 5 records of 6, but it says Error for each indexer resource.

$ kubectl describe indexerclusters.enterprise.splunk.com site4-39188 -n splunk-operator
....
Status:
  Idxc Password Changed Secrets:
  Cluster Manager Phase:  Ready
  indexer_secret_changed_flag:
  indexing_ready_flag:                       true
  initialized_flag:                          true
  maintenance_mode:                          false
  namespace_scoped_secret_resource_version:  23540
  Peers:
    active_bundle_id:  28AE1272BABED6776EC277C6C661CCCC
    bucket_count:      64
    GUID:              681F5B56-E71A-43F6-ABC1-A9C11B7D1A9E
    is_searchable:     true
    Name:              splunk-site4-39188-indexer-0
    Status:            Up
    active_bundle_id:  28AE1272BABED6776EC277C6C661CCCC
    bucket_count:      56
    GUID:              1C571956-06BF-43E1-B706-7626B15B0C32
    is_searchable:     true
    Name:              splunk-site4-39188-indexer-1
    Status:            Up
    active_bundle_id:  28AE1272BABED6776EC277C6C661CCCC
    bucket_count:      64
    GUID:              4564371D-1FC6-4A9C-A5D2-18A26DB70783
    is_searchable:     true
    Name:              splunk-site4-39188-indexer-2
    Status:            Up
  Phase:               Error
  Ready Replicas:      3
  Replicas:            4
  Selector:            app.kubernetes.io/instance=splunk-site4-39188-indexer
  service_ready_flag:  true
Events:                <none>

doesn't give any error code. just Phase: error.

Expected behavior There is an article with troubleshooting section, where are some advices given to check statuses.

yaroslav-nakonechnikov commented 9 months ago

so, i tried to downgrade 2.5.1 (and 2.5.0) to 2.4.0 - and it worked. What, actually, not expected at all.

vivekr-splunk commented 8 months ago

Hi @yaroslav-nakonechnikov , greetings, when you say it worked after downgrading to 2.4.0 can you elaborate what worked. Are you saying phase changed to "Ready"?

yaroslav-nakonechnikov commented 8 months ago

@vivekr-splunk yes, phase status became ready, and site4 was created back, as expected.

yaroslav-nakonechnikov commented 8 months ago

it is starting to be a critical.

as crd created:

[yn@ip-10-216-35-48 /]$ kubectl get indexerclusters -n splunk-operator
NAME          PHASE   MASTER   MANAGER   DESIRED   READY   AGE
site1-32002   Error            Ready     3         0       77m
site2-32002   Error            Ready     3         0       77m
site3-32002   Error            Ready     3         0       77m
site4-32002   Error            Ready     3         0       77m
site5-32002   Error            Ready     3         0       77m
site6-32002   Error            Ready     3         5       5d4h

in manager i see next logs:

2024-03-19T13:49:32.244551457Z  INFO    start   {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "indexercluster": "splunk-operator/site5-32002", "CR version": "4044269"}
2024-03-19T13:49:32.244647588Z  INFO    ApplyConfigMap  No changes for ConfigMap        {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "splunk-site5-32002-indexer-defaults", "namespace": "splunk-operator"}
2024-03-19T13:49:32.413422389Z  INFO    ApplyService    No update to existing Service   {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "splunk-site5-32002-indexer-headless", "namespace": "splunk-operator"}
2024-03-19T13:49:32.413454413Z  INFO    ApplyService    No update to existing Service   {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "splunk-site5-32002-indexer-service", "namespace": "splunk-operator"}
2024-03-19T13:49:32.4136215Z    INFO    ApplyConfigMap  No changes for ConfigMap        {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "splunk-splunk-operator-probe-configmap", "namespace": "splunk-operator"}
2024-03-19T13:49:32.413793087Z  INFO    getLivenessProbe        LivenessProbe   {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "site5-32002", "namespace": "splunk-operator", "Configured": "&Probe{ProbeHandler:ProbeHandler{Exec:&ExecAction{Command:[/mnt/probes/livenessProbe.sh],},HTTPGet:nil,TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:240,TimeoutSeconds:30,PeriodSeconds:30,SuccessThreshold:0,FailureThreshold:60,TerminationGracePeriodSeconds:nil,}"}
2024-03-19T13:49:32.413815103Z  INFO    getReadinessProbe       ReadinessProbe  {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "site5-32002", "namespace": "splunk-operator", "Configured": "&Probe{ProbeHandler:ProbeHandler{Exec:&ExecAction{Command:[/mnt/probes/readinessProbe.sh],},HTTPGet:nil,TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:240,TimeoutSeconds:30,PeriodSeconds:30,SuccessThreshold:0,FailureThreshold:60,TerminationGracePeriodSeconds:nil,}"}
2024-03-19T13:49:32.41382608Z   INFO    getStartupProbe StartupProbe    {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "site5-32002", "namespace": "splunk-operator", "Configured": "&Probe{ProbeHandler:ProbeHandler{Exec:&ExecAction{Command:[/mnt/probes/startupProbe.sh],},HTTPGet:nil,TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:40,TimeoutSeconds:30,PeriodSeconds:60,SuccessThreshold:0,FailureThreshold:60,TerminationGracePeriodSeconds:nil,}"}
2024-03-19T13:49:32.41399514Z   INFO    isClusterManagerReadyForUpgrade kind is set to  {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "name": "site5-32002", "namespace": "splunk-operator", "kind": "IndexerCluster"}
2024-03-19T13:49:32.414144671Z  INFO    updateCRStatus  Trying to update        {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "original cr version": "4044269", "count": 0}
2024-03-19T13:49:32.428603698Z  INFO    updateCRStatus  Status update successful        {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "original cr version": "4044269", "current CR version": "4044269", "updated CR version":"4044269"}
2024-03-19T13:49:32.428634239Z  INFO    updateCRStatus  Cache is reflecting the latest CR       {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "original cr version": "4044269", "updated CR version": "4044269"}
2024-03-19T13:49:32.428639088Z  INFO    Requeued        {"controller": "indexercluster", "controllerGroup": "enterprise.splunk.com", "controllerKind": "IndexerCluster", "IndexerCluster": {"name":"site5-32002","namespace":"splunk-operator"}, "namespace": "splunk-operator", "name": "site5-32002", "reconcileID": "34cb7d35-7ea9-4608-9181-0380bf16d8aa", "indexercluster": "splunk-operator/site5-32002", "period(seconds)": 5}

but statefulset is not created:

[yn@ip-10-216-35-48 /]$ kubectl get statefulset -n splunk-operator
NAME                              READY   AGE
splunk-32002-cluster-manager      1/1     41m
splunk-32002-license-manager      1/1     62m
splunk-32002-monitoring-console   0/1     24m
[yn@ip-10-216-35-48 /]$

why? what is wrong?

vivekr-splunk commented 7 months ago

@yaroslav-nakonechnikov, we acknowledge that error codes and messages are not clearly documented. We're currently planning to revamp the error handling process and include error messages in the status to provide a clearer explanation of why the reconciliation fails. We'll keep you informed once this is completed.

akondur commented 4 months ago

Hey @yaroslav-nakonechnikov , this MR adds a message field to the CR status section indicating details of the error message. Can you please try it and let us know if this solution works?

yaroslav-nakonechnikov commented 4 months ago

i believe it can be closed for now, as documentation updated. If new questions will be - new ticket will be raised.

yaroslav-nakonechnikov commented 3 weeks ago

yes, not it is much better!

$ kubectl get indexerclusters.enterprise.splunk.com -n splunk-operator
NAME          PHASE   MASTER   MANAGER   DESIRED   READY   AGE   MESSAGE
site1-43345   Error            Ready     3         0       16m   StatefulSet.apps "splunk-e-43345-search-head" not found
site2-43345   Error            Ready     3         0       16m   StatefulSet.apps "splunk-e-43345-search-head" not found
site3-43345   Error            Error     3         0       16m   StatefulSet.apps "splunk-43345-cluster-manager" not found
site4-43345   Error            Ready     3         0       16m   StatefulSet.apps "splunk-e-43345-search-head" not found
site5-43345   Error            Error     3         0       16m   StatefulSet.apps "splunk-43345-cluster-manager" not found
site6-43345   Error            Ready     3         0       16m   StatefulSet.apps "splunk-e-43345-search-head" not found

$ kubectl get indexerclusters.enterprise.splunk.com -n splunk-operator
NAME          PHASE   MASTER   MANAGER   DESIRED   READY   AGE   MESSAGE
site1-43345   Error            Ready     3         0       16m
site2-43345   Error            Ready     3         0       16m
site3-43345   Error            Error     3         0       16m   StatefulSet.apps "splunk-43345-cluster-manager" not found
site4-43345   Error            Ready     3         0       16m
site5-43345   Error            Error     3         0       16m   StatefulSet.apps "splunk-43345-cluster-manager" not found
site6-43345   Error            Ready     3         0       16m

$ kubectl get indexerclusters.enterprise.splunk.com -n splunk-operator
NAME          PHASE   MASTER   MANAGER   DESIRED   READY   AGE   MESSAGE
site1-43345   Error            Ready     3         0       16m   could not get cluster info from cluster manager
site2-43345   Error            Ready     3         0       16m   could not get cluster info from cluster manager
site3-43345   Error            Error     3         0       17m   StatefulSet.apps "splunk-43345-cluster-manager" not found
site4-43345   Error            Ready     3         0       16m   could not get cluster info from cluster manager
site5-43345   Error            Error     3         0       17m   StatefulSet.apps "splunk-43345-cluster-manager" not found
site6-43345   Error            Ready     3         0       16m   could not get cluster info from cluster manager