splunk / splunk-operator

Splunk Operator for Kubernetes
Other
210 stars 115 forks source link

Splunk Operator: something breaking local config files on pod restart #1212

Closed yaroslav-nakonechnikov closed 11 months ago

yaroslav-nakonechnikov commented 1 year ago

Please select the type of request

Bug

Tell us more

Describe the request We see time to time strange behavior, that config files, which were pushed thru default.yml is broken after pod restart.

[splunk@splunk-prod-cluster-manager-0 splunk]$ cat /opt/splunk/etc/system/local/authentication.conf

[authentication]
authSettings = saml
authType = SAML
authSettings
authType

[saml]
entityId = splunkACSEntityId
fqdn = https://cm.fqdn.cloud
idpSSOUrl = https://idp.fqdn.com/idp/SSO.saml2
inboundDigestMethod = SHA1;SHA256;SHA384;SHA512
inboundSignatureAlgorithm = RSA-SHA1;RSA-SHA256;RSA-SHA384;RSA-SHA512
issuerId = idp:fqdn.com:saml2
lockRoleToFullDN = True
redirectAfterLogoutToUrl = https://www.splunk.com
redirectPort = 443
replicateCertificates = True
signAuthnRequest = True
signatureAlgorithm = RSA-SHA1
signedAssertion = True
sloBinding = HTTP-POST
ssoBinding = HTTP-POST
clientCert = /mnt/certs/saml_sig.pem
idpCertPath = /mnt/certs/
entityId
fqdn
idpSSOUrl
inboundDigestMethod
inboundSignatureAlgorithm
issuerId
lockRoleToFullDN
redirectAfterLogoutToUrl
redirectPort
replicateCertificates
signAuthnRequest
signatureAlgorithm
signedAssertion
sloBinding
ssoBinding
clientCert
idpCertPath

[roleMap_SAML]
admin = ldap-group-a
cloudgateway = ldap-group-b
dashboard = ldap-group-c
ess_admin = ldap-group-d
ess_analyst = ldap-group-e;ldap-group-f;ldap-group-g
...
splunk_soc_l1_l2 = ldap-group-y
splunk_soc_l3 = ldap-group-x
admin
cloudgateway
dashboard
ess_admin
ess_analyst
...
splunk_soc_l1_l2
splunk_soc_l3

so, list of keys were duplicated without value.

Here is a configmap:

[yn@ip-10-224-31-36 /]$ kubectl get configmap splunk-prod-indexer-defaults -o yaml
apiVersion: v1
data:
  default.yml: |-
    splunk:
      site: site1
      multisite_master: localhost
      all_sites: site1,site2,site3,site4,site5,site6
      multisite_replication_factor_origin: 1
      multisite_replication_factor_total: 3
      multisite_search_factor_origin: 1
      multisite_search_factor_total: 3
      idxc:
        # search_factor: 3
        # replication_factor: 3
        app_paths_install:
          default:
            - https://path.to.app/config-explorer_1715.tgz
        apps_location:
          - https://path.to.app/config-explorer_1715.tgz
      app_paths:
        idxc: "/opt/splunk/etc/manager-apps"
      app_paths_install:
        default:
          - https://path.to.app/config-explorer_1715.tgz
        idxc:
          - https://path.to.app/cmp_indexer_indexes.tgz
          - https://path.to.app/cmp_resmonitor.tgz
          - https://path.to.app/cmp_soar_indexes.tgz
      conf:
        - key: server
          value:
            directory: /opt/splunk/etc/system/local
            content:
              imds:
                imds_version: v2
        - key: deploymentclient
          value:
            directory: /opt/splunk/etc/system/local
            content:
              deployment-client :
                disabled : false
              target-broker:deploymentServer :
                targetUri : ds.shared.cmp-a.internal.cmpgroup.cloud:8089
        - key: web
          value:
            directory: /opt/splunk/etc/system/local
            content:
              settings:
                enableSplunkWebSSL: true
        - key: authentication
          value:
            directory: /opt/splunk/etc/system/local
            content:
              authentication:
                authSettings : saml
                authType : SAML
              saml:
                entityId : splunkACSEntityId
                fqdn : https://cm.fqdn.cloud
                idpSSOUrl : https://idp.fqdn.com/idp/SSO.saml2
                inboundDigestMethod : SHA1;SHA256;SHA384;SHA512
                inboundSignatureAlgorithm : RSA-SHA1;RSA-SHA256;RSA-SHA384;RSA-SHA512
                issuerId : idp:fqdn.com:saml2
                lockRoleToFullDN : true
                redirectAfterLogoutToUrl : https://www.splunk.com
                redirectPort : 443
                replicateCertificates : true
                signAuthnRequest : true
                signatureAlgorithm : RSA-SHA1
                signedAssertion : true
                sloBinding : HTTP-POST
                ssoBinding : HTTP-POST
                clientCert : /mnt/certs/saml_sig.pem
                idpCertPath: /mnt/certs/
              roleMap_SAML:
                admin : ldap-group-a
                cloudgateway : ldap-group-b
                dashboard : ldap-group-c
                ess_admin : ldap-group-d
                ess_analyst : ldap-group-e;ldap-group-f;ldap-group-g
                ...
                splunk_soc_l1_l2 : ldap-group-y
                splunk_soc_l3 : ldap-group-x
        - key: authorize
          value:
            directory: /opt/splunk/etc/system/local
            content:
              role_admin:
                run_script_adhocremotesearchraw : enabled
                run_script_adhocremotesearch : enabled
                run_script_environmentpoller : enabled
                run_script_sleepy : enabled
kind: ConfigMap
metadata:
  creationTimestamp: "2023-02-24T16:53:17Z"
  name: splunk-prod-indexer-defaults
  namespace: splunk-operator
  ownerReferences:
  - apiVersion: enterprise.splunk.com/v4
    controller: true
    kind: ClusterManager
    name: prod
    uid: 84aa7496-eb5a-4ffb-9549-c42f7780450e
  resourceVersion: "95698835"
  uid: 47b70fd9-0398-4aa0-ace5-20a5ac9d4842

Expected behavior default.yml is rendering each run same way. without issues.

Splunk setup on K8S EKS 1.27 Splunk Operator 2.3.0 Splunk 9.1.0.2

Reproduction/Testing steps after some unpredicted restart of pod, new pod started with broken config.

yaroslav-nakonechnikov commented 1 year ago

same thing happened in etc/system/local/server.conf:

[splunk@splunk-prod-cluster-manager-0 splunk]$ cat etc/system/local/server.conf | grep "\[imds\]" -A 3
[imds]
imds_version = v2
imds_version

and etc/system/local/web.conf

[splunk@splunk-prod-cluster-manager-0 splunk]$ cat etc/system/local/web.conf | grep "\[settings\]" -A 3
[settings]
mgmtHostPort = 0.0.0.0:8089
enableSplunkWebSSL = True
enableSplunkWebSSL

so, each file, which was defined in conf section is broken.

yaroslav-nakonechnikov commented 1 year ago

kubectl delete pod - initiates recreation of pod, and all seems fine. But we want to find root cause, as this can happen anywhere!

yaroslav-nakonechnikov commented 1 year ago

unmasked diag uploaded in case #3285863

yaroslav-nakonechnikov commented 1 year ago

i found how i can replicate issue: delete/stop/whatever with splunk process in pod and in sometime liveness probe will trigger restart of pod and after that you'll see broken config

yaroslav-nakonechnikov commented 1 year ago

reported: https://github.com/splunk/splunk-ansible/issues/751

vivekr-splunk commented 1 year ago

@iaroslav-nakonechnikov we are looking into this issue now, will update you with our findings.

yaroslav-nakonechnikov commented 1 year ago

issue still exist in 9.1.1

vivekr-splunk commented 1 year ago

@yaroslav-nakonechnikov , we are working with splunk-ansible team to fix this issue. will update you once that is done.

yaroslav-nakonechnikov commented 1 year ago

was it fixed?

vivekr-splunk commented 1 year ago

Hi @yaroslav-nakonechnikov , this fix didnt go in 9.1.1 . its planned for 9.1.2 . will update you once the release is complete.

yaroslav-nakonechnikov commented 11 months ago

@vivekr-splunk 9.1.2 released, but still no news here. is there any ETA?

vivekr-splunk commented 11 months ago

Hello @yaroslav-nakonechnikov this is fixed in 9.1.2 build.

yaroslav-nakonechnikov commented 11 months ago

i managed to test it, and yes. it looks like this fixed. but https://github.com/splunk/splunk-operator/issues/1260