Closed yaroslav-nakonechnikov closed 8 months ago
so, after several tests i can give what breaks.
in cluster manager definition we had that:
smartstore:
defaults:
maxGlobalDataSizeMB: 0
maxGlobalRawDataSizeMB: 0
volumeName: smartstore
indexes:
- hotlistBloomFilterRecencyHours: 1
hotlistRecencySecs: 3600
name: tf-test
remotePath: tf-test/
volumeName: smartstore
volumes:
- endpoint: https://s3-eu-central-1.amazonaws.com
name: smartstore
path: bucket-for-smart-store
provider: aws
region: eu-central-1
storageType: s3
and when we removed that block and recreated cm and indexers - all started to work.
and it has same behavior with splunk-operator versions 2.4.0
and latest
and, final tests showing, that problem in defaults section.
so, my further investigation leads that splunk-operator creates default settings:
[splunk@splunk-site1-indexer-0 splunk]$ bin/splunk btool indexes list --debug | grep "\[default\]"
/opt/splunk/etc/peer-apps/splunk-operator/local/indexes.conf [default]
[splunk@splunk-site1-indexer-0 splunk]$ cat /opt/splunk/etc/peer-apps/splunk-operator/local/indexes.conf
[default]
repFactor = auto
maxDataSize = auto
homePath = $SPLUNK_DB/$_index_name/db
coldPath = $SPLUNK_DB/$_index_name/colddb
thawedPath = $SPLUNK_DB/$_index_name/thaweddb
[volume:smartstore]
storageType = remote
path = s3://bucket-for-smart-store
remote.s3.endpoint = https://s3-eu-central-1.amazonaws.com
remote.s3.auth_region = eu-central-1
and doesn't work with definition from crd.
also, we had some default settings defined in our custom created app, and it also breaks indexer startup. so something changed which shouldn't be touched.
hello @yaroslav-nakonechnikov are you using IRSA with privatelink
@vivekr-splunk, no, we don't use privatelink.
main point, that with 9.1.1 same config was working fine.
Hello @yaroslav-nakonechnikov this has been fixed in upcoming release of 9.1.3 and 9.0.7 and also in 9.2.1.
still same issue with 9.1.3
FAILED - RETRYING: Restart the splunkd service - Via CLI (5 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (4 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (3 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (2 retries left).
FAILED - RETRYING: Restart the splunkd service - Via CLI (1 retries left).
RUNNING HANDLER [splunk_common : Restart the splunkd service - Via CLI] ********
fatal: [localhost]: FAILED! => {
"attempts": 60,
"changed": true,
"cmd": [
"/opt/splunk/bin/splunk",
"restart",
"--answer-yes",
"--accept-license"
],
"delta": "0:00:11.173687",
"end": "2024-01-25 15:06:11.736729",
"rc": 10,
"start": "2024-01-25 15:06:00.563042"
}
STDOUT:
splunkd is not running.
Splunk> 4TW
Checking prerequisites...
Checking mgmt port [8089]: open
Checking kvstore port [8191]: open
Checking configuration... Done.
STDERR:
ERROR: pid 5825 terminated with signal 11 (core dumped)
Validating databases (splunkd validatedb) failed with code '-1'. If you cannot resolve the issue(s) above after consulting documentation, please file a case online at http://www.splunk.com/page/submit_issue
MSG:
non-zero return code
Thursday 25 January 2024 15:06:11 +0000 (0:22:44.302) 0:23:46.336 ******
Thursday 25 January 2024 15:06:11 +0000 (0:00:00.000) 0:23:46.336 ******
Thursday 25 January 2024 15:06:11 +0000 (0:00:00.000) 0:23:46.337 ******
PLAY RECAP *********************************************************************
localhost : ok=106 changed=20 unreachable=0 failed=1 skipped=67 rescued=0 ignored=0
Thursday 25 January 2024 15:06:11 +0000 (0:00:00.003) 0:23:46.341 ******
===============================================================================
splunk_common : Restart the splunkd service - Via CLI ---------------- 1364.30s
splunk_common : Restart the splunkd service - Via CLI ------------------ 18.39s
splunk_common : Set options in saml ------------------------------------- 6.26s
splunk_common : Set options in roleMap_SAML ----------------------------- 6.04s
splunk_common : Get Splunk status --------------------------------------- 1.43s
splunk_common : Set node as license slave ------------------------------- 1.17s
splunk_indexer : Update HEC token configuration ------------------------- 1.17s
Gathering Facts --------------------------------------------------------- 1.14s
splunk_indexer : Set current node as indexer cluster peer --------------- 1.12s
splunk_common : Update /opt/splunk/etc ---------------------------------- 0.97s
splunk_indexer : Setup Peers with Associated Site ----------------------- 0.97s
splunk_common : Set options in authentication --------------------------- 0.88s
splunk_common : Test basic https endpoint ------------------------------- 0.79s
splunk_indexer : Setup global HEC --------------------------------------- 0.70s
splunk_indexer : Check for required restarts ---------------------------- 0.68s
Check for required restarts --------------------------------------------- 0.67s
splunk_indexer : Get existing HEC token --------------------------------- 0.67s
splunk_indexer : Check Splunk instance is running ----------------------- 0.67s
splunk_indexer : Check Splunk instance is running ----------------------- 0.66s
splunk_common : Check Splunk instance is running ------------------------ 0.66s
that one looks like fixed in 9.2.* but still testing
Still hitting with the same error on 9.2.0 and Splunk Operator 2.5.0
@fabiusgoh have you raised ticket in splunk support? may i ask you for its number?
i have not raised a support ticket yet, am in the midst to test it out on 9.1.3 as it is the officially supported version for the operator
i can confirm, 9.2 and 9.2.0.1 starts with our config. which wasn't working with 9.1.2 and 9.1.3
@yaroslav-nakonechnikov, As we discussed in our meeting, we now understand the issue. This problem arose due to the upgrade path we followed in the 2.5.0 release. Previously, we expected the search head clusters to be running before starting the indexers (if both indexers and SHC are pointing to the same CM). However, since the SHC had trouble starting, the indexers were never created. As agreed, we will modify the logic to start the indexers parallel to the search head. We'll keep you updated on our progress with these changes.
@vivekr-splunk yep, i agree, it was informative meeting. But this ticket is different, as it is about Splunk logic itself(or splunk-ansible), which was fixed in splunk container starting from 9.2.0.
we were discussing : https://github.com/splunk/splunk-operator/issues/1293
also, today i've rechecked 9.1.4 - is it not working as well. so, 9.1.1 last working version and the last supported version.
all others are broken or not supported.
Please select the type of request
Bug
Tell us more
Describe the request All nodes starting as expected, but only indexers can't
Expected behavior all works as it was
Splunk setup on K8S eks
Reproduction/Testing steps
K8s environment 1.28
Additional context(optional)