splunk / splunk-operator

Splunk Operator for Kubernetes
Other
210 stars 115 forks source link

Splunk Operator: indexers don't start with 9.1.2 #1260

Closed yaroslav-nakonechnikov closed 9 months ago

yaroslav-nakonechnikov commented 11 months ago

Please select the type of request

Bug

Tell us more

Describe the request All nodes starting as expected, but only indexers can't

Expected behavior all works as it was

Splunk setup on K8S eks

Reproduction/Testing steps

K8s environment 1.28

Additional context(optional)

yaroslav-nakonechnikov commented 11 months ago

so, after several tests i can give what breaks.

in cluster manager definition we had that:

 smartstore:
    defaults:
      maxGlobalDataSizeMB: 0
      maxGlobalRawDataSizeMB: 0
      volumeName: smartstore
    indexes:
    - hotlistBloomFilterRecencyHours: 1
      hotlistRecencySecs: 3600
      name: tf-test
      remotePath: tf-test/
      volumeName: smartstore
    volumes:
    - endpoint: https://s3-eu-central-1.amazonaws.com
      name: smartstore
      path: bucket-for-smart-store
      provider: aws
      region: eu-central-1
      storageType: s3

and when we removed that block and recreated cm and indexers - all started to work.

and it has same behavior with splunk-operator versions 2.4.0 and latest

yaroslav-nakonechnikov commented 11 months ago

and, final tests showing, that problem in defaults section.

yaroslav-nakonechnikov commented 11 months ago

so, my further investigation leads that splunk-operator creates default settings:

[splunk@splunk-site1-indexer-0 splunk]$ bin/splunk btool indexes  list --debug | grep "\[default\]"
/opt/splunk/etc/peer-apps/splunk-operator/local/indexes.conf                 [default]
[splunk@splunk-site1-indexer-0 splunk]$ cat /opt/splunk/etc/peer-apps/splunk-operator/local/indexes.conf
[default]
repFactor = auto
maxDataSize = auto
homePath = $SPLUNK_DB/$_index_name/db
coldPath = $SPLUNK_DB/$_index_name/colddb
thawedPath = $SPLUNK_DB/$_index_name/thaweddb

[volume:smartstore]
storageType = remote
path = s3://bucket-for-smart-store
remote.s3.endpoint = https://s3-eu-central-1.amazonaws.com
remote.s3.auth_region = eu-central-1

and doesn't work with definition from crd.

also, we had some default settings defined in our custom created app, and it also breaks indexer startup. so something changed which shouldn't be touched.

vivekr-splunk commented 11 months ago

hello @yaroslav-nakonechnikov are you using IRSA with privatelink

yaroslav-nakonechnikov commented 11 months ago

@vivekr-splunk, no, we don't use privatelink.

main point, that with 9.1.1 same config was working fine.

vivekr-splunk commented 11 months ago

Hello @yaroslav-nakonechnikov this has been fixed in upcoming release of 9.1.3 and 9.0.7 and also in 9.2.1.

yaroslav-nakonechnikov commented 10 months ago

still same issue with 9.1.3

FAILED - RETRYING: Restart the splunkd service - Via CLI (5 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (4 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (3 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (2 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (1 retries left).

RUNNING HANDLER [splunk_common : Restart the splunkd service - Via CLI] ********
fatal: [localhost]: FAILED! => {
    "attempts": 60,
    "changed": true,
    "cmd": [
        "/opt/splunk/bin/splunk",
        "restart",
        "--answer-yes",
        "--accept-license"
    ],
    "delta": "0:00:11.173687",
    "end": "2024-01-25 15:06:11.736729",
    "rc": 10,
    "start": "2024-01-25 15:06:00.563042"
}

STDOUT:

splunkd is not running.

Splunk> 4TW

Checking prerequisites...
        Checking mgmt port [8089]: open
        Checking kvstore port [8191]: open
        Checking configuration... Done.

STDERR:

ERROR: pid 5825 terminated with signal 11 (core dumped)
Validating databases (splunkd validatedb) failed with code '-1'.  If you cannot resolve the issue(s) above after consulting documentation, please file a case online at http://www.splunk.com/page/submit_issue

MSG:

non-zero return code
Thursday 25 January 2024  15:06:11 +0000 (0:22:44.302)       0:23:46.336 ******
Thursday 25 January 2024  15:06:11 +0000 (0:00:00.000)       0:23:46.336 ******
Thursday 25 January 2024  15:06:11 +0000 (0:00:00.000)       0:23:46.337 ******

PLAY RECAP *********************************************************************
localhost                  : ok=106  changed=20   unreachable=0    failed=1    skipped=67   rescued=0    ignored=0

Thursday 25 January 2024  15:06:11 +0000 (0:00:00.003)       0:23:46.341 ******
===============================================================================
splunk_common : Restart the splunkd service - Via CLI ---------------- 1364.30s
splunk_common : Restart the splunkd service - Via CLI ------------------ 18.39s
splunk_common : Set options in saml ------------------------------------- 6.26s
splunk_common : Set options in roleMap_SAML ----------------------------- 6.04s
splunk_common : Get Splunk status --------------------------------------- 1.43s
splunk_common : Set node as license slave ------------------------------- 1.17s
splunk_indexer : Update HEC token configuration ------------------------- 1.17s
Gathering Facts --------------------------------------------------------- 1.14s
splunk_indexer : Set current node as indexer cluster peer --------------- 1.12s
splunk_common : Update /opt/splunk/etc ---------------------------------- 0.97s
splunk_indexer : Setup Peers with Associated Site ----------------------- 0.97s
splunk_common : Set options in authentication --------------------------- 0.88s
splunk_common : Test basic https endpoint ------------------------------- 0.79s
splunk_indexer : Setup global HEC --------------------------------------- 0.70s
splunk_indexer : Check for required restarts ---------------------------- 0.68s
Check for required restarts --------------------------------------------- 0.67s
splunk_indexer : Get existing HEC token --------------------------------- 0.67s
splunk_indexer : Check Splunk instance is running ----------------------- 0.67s
splunk_indexer : Check Splunk instance is running ----------------------- 0.66s
splunk_common : Check Splunk instance is running ------------------------ 0.66s
yaroslav-nakonechnikov commented 9 months ago

that one looks like fixed in 9.2.* but still testing

fabiusgoh commented 9 months ago

Still hitting with the same error on 9.2.0 and Splunk Operator 2.5.0

yaroslav-nakonechnikov commented 9 months ago

@fabiusgoh have you raised ticket in splunk support? may i ask you for its number?

fabiusgoh commented 9 months ago

i have not raised a support ticket yet, am in the midst to test it out on 9.1.3 as it is the officially supported version for the operator

yaroslav-nakonechnikov commented 9 months ago

i can confirm, 9.2 and 9.2.0.1 starts with our config. which wasn't working with 9.1.2 and 9.1.3

vivekr-splunk commented 7 months ago

@yaroslav-nakonechnikov, As we discussed in our meeting, we now understand the issue. This problem arose due to the upgrade path we followed in the 2.5.0 release. Previously, we expected the search head clusters to be running before starting the indexers (if both indexers and SHC are pointing to the same CM). However, since the SHC had trouble starting, the indexers were never created. As agreed, we will modify the logic to start the indexers parallel to the search head. We'll keep you updated on our progress with these changes.

yaroslav-nakonechnikov commented 7 months ago

@vivekr-splunk yep, i agree, it was informative meeting. But this ticket is different, as it is about Splunk logic itself(or splunk-ansible), which was fixed in splunk container starting from 9.2.0.

we were discussing : https://github.com/splunk/splunk-operator/issues/1293

yaroslav-nakonechnikov commented 7 months ago

also, today i've rechecked 9.1.4 - is it not working as well. so, 9.1.1 last working version and the last supported version.

all others are broken or not supported.