splunk / splunk-operator

Splunk Operator for Kubernetes
Other
206 stars 114 forks source link

Splunk Operator: Multisite cluster with PV on the indexers #1152

Open niraj-desai-sf opened 1 year ago

niraj-desai-sf commented 1 year ago

Please select the type of request

Bug

Tell us more

Describe the request

Expected behavior

Splunk setup on K8S

Reproduction/Testing steps

K8s environment

akondur commented 1 year ago

Hi @niraj-desai-sf , thanks for opening the issue. To debug further could you please

  1. Check if there are any stale pvc's and delete them before deploying the indexer cluster?
  2. Provide the config for the clusterManager and all of the indexer clusters. Are you facing the issue on all the indexer clusters?
  3. Explain the podAntiAffinity configuration i.e are you trying to schedule the IDXC in a node other than the one the CM is running on? As long as nodeAffinity with the topology.kubernetes.io/zone set to the right zone that should be enough to schedule the pod correctly.
niraj-desai-sf commented 1 year ago

@akondur I want to avoid 1, coz we have data on the indexers right now and want to see if there is something that we can do to resolve it..

  1. CM configs

    apiVersion: enterprise.splunk.com/v4 kind: ClusterManager metadata: name: {{ .Values.cmClusterName }} namespace: {{ .Values.namespace }} spec: defaults: |- splunk: site: site1 multisite_master: localhost all_sites: site1,site2 multisite_replication_factor_origin: 1 multisite_replication_factor_total: 1 multisite_search_factor_origin: 1 multisite_search_factor_total: 1 idxc: search_factor: 1 replication_factor: 1 extraEnv:

    • name: SPLUNK_HTTP_ENABLESSL value : "true"
    • name: PLATFORM_ENV_FI value : {{ .Values.platform_labels.p_environment }} serviceAccount: {{ .Values.serviceAccount}} monitoringConsoleRef: name: {{ .Values.mcClusterName }} licenseManagerRef: name: {{ .Values.lmClusterName }} varVolumeStorageConfig: storageCapacity: {{ .Values.cmVarVolumeSize }} storageClassName: {{ .Values.cmVarStorageClass }} etcVolumeStorageConfig: storageCapacity: {{ .Values.cmEtcVolumeSize }} storageClassName: {{ .Values.cmEtcStorageClass }}

Indexer configs Site 1

apiVersion: enterprise.splunk.com/v3 kind: IndexerCluster metadata: name: {{ .Values.indexerClusterName }} namespace: {{ .Values.namespace }} spec: replicas: 5 defaults: |- splunk: multisite_master: splunk-{{ .Values.cmClusterName }}-cluster-manager-service site: site1 extraEnv:

  • name: SPLUNK_HTTP_ENABLESSL value : "true"
  • name: PLATFORM_ENV_FI value : {{ .Values.platform_labels.p_environment }} serviceAccount: {{ .Values.serviceAccount}} clusterManagerRef: name: {{ .Values.cmClusterName }} monitoringConsoleRef: name: {{ .Values.mcClusterName }} varVolumeStorageConfig: storageCapacity: {{ .Values.indexerVarVolumeSize }} storageClassName: {{ .Values.indexerVarStorageClass }} etcVolumeStorageConfig: storageCapacity: {{ .Values.indexerEtcVolumeSize }} storageClassName: {{ .Values.indexerEtcStorageClass }} affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:
    • matchExpressions:
      • key: topology.kubernetes.io/zone operator: In values:
        • zone-a podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution:
        • podAffinityTerm: labelSelector: matchExpressions:
        • key: "app.kubernets.io/instance" operator: In values:
          • "splunk-{{ .Values.cmClusterName }}-cluster-manager" topologyKey: "kubernetes.io/hostname" weight: 100

Site 2

apiVersion: enterprise.splunk.com/v3 kind: IndexerCluster metadata: name: {{ .Values.indexerClusterNameSite2 }} namespace: {{ .Values.namespace }} spec: replicas: 5

defaults: |- splunk: multisite_master: splunk-{{ .Values.cmClusterName }}-cluster-manager-service site: site2 extraEnv:

  • name: SPLUNK_HTTP_ENABLESSL value : "true"
  • name: PLATFORM_ENV_FI value : {{ .Values.platform_labels.p_environment }} serviceAccount: {{ .Values.serviceAccount}} clusterManagerRef: name: {{ .Values.cmClusterName }} monitoringConsoleRef: name: {{ .Values.mcClusterName }} varVolumeStorageConfig: storageCapacity: {{ .Values.indexerVarVolumeSize }} storageClassName: {{ .Values.indexerVarStorageClass }} etcVolumeStorageConfig: storageCapacity: {{ .Values.indexerEtcVolumeSize }} storageClassName: {{ .Values.indexerEtcStorageClass }} affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:
    • matchExpressions:
      • key: topology.kubernetes.io/zone operator: In values:
        • zone-b podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution:
        • podAffinityTerm: labelSelector: matchExpressions:
        • key: "app.kubernets.io/instance" operator: In values:
          • "splunk-{{ .Values.cmClusterName }}-cluster-manager" topologyKey: "kubernetes.io/hostname" weight: 100

Yes the issue is on both the indexer cluster..

  1. podAntiaffinity is not schedule the pod on the CM but that doesnt seem to work either..
niraj-desai-sf commented 1 year ago

Also i am not running Splunk operator on the PV, hope that doesnt have any impact on this..

akondur commented 1 year ago

Hi @niraj-desai-sf,

The Splunk Operator PV has no impact here. The configuration for the indexer clusters looks fine. A few comments here:

  1. The podAntiAffinity is not required as nodeAffinity with nodeSelectorTerms is sufficient
  2. Please make sure that the nodeSelectorTerms labels topology.kubernetes.io/zone have values of zone-b and zone-a on the nodes using the following command - kubectl describe nodes | grep -i topology.kubernetes.io/zone
  3. If CM can be on any node it shouldn't need any nodeAffinity or podAntiAffinity

The problem maybe that the indexer pods that are deployed might be trying to attach to the older pvc's which are on a node different to the one the pod is scheduled to run on. In other words, you maybe running into volume node affinity conflict.

To debug further, please provide:

  1. Output of kubectl describe pod <name_of_indexer_pod>. Please provide this for one indexer pod from both clusters.
  2. Output of kubectl describe <stale_pvc_of_indexer_pod>. Please provide this for one stale pvc per indexer pod from both clusters.

If this is the case and all of the stale pvc's are on the same node then we will not be able to reuse the stale pvc's for a multisite deployment(for obvious reasons). You could deploy the new indexer clusters with different names such that new pvc's are created and you can copy over the data manually from the older stale pvc's.

akondur commented 1 year ago

Hey @niraj-desai-sf , please let us know if the issue has been resolved with the above recommendations.

niraj-desai-sf commented 1 year ago

@akondur sorry couldnt follow up on this before. We are trying to clear up the PVCs to see if that was the problem. I will update once we have tested the use case.