Closed credibleforce closed 4 years ago
The default.yml
you provide to the container is read and merged over one that's baked into splunk-ansible here: https://github.com/splunk/splunk-ansible/blob/develop/inventory/splunk_defaults_linux.yml. So if you neglect to pass in splunk.hec
or splunk.hec.*
, this ensures that every stanza is fully defined by falling back to some safe defaults.
Right, that makes sense and I guess I was misreading the intention of Setup global HEC
. When I pass this:
splunk:
hec:
enable: False
port: 8088
ssl: True
token:
I guess I was expecting Setup global HEC
to not run at all, versus passing the correct value to disabled:
. But for multiple runs of this task on already configured instances I can understand why this is being done.
The problem I am seeing is that during build, Setup global HEC
is failing the during cluster build. I found that if I forced Setup global HEC
to be skipped during the build everything would complete.
This is probably more of an issue with the lab environment I'm running than the logical flow of the code although having a way to limit the number of calls to Splunk during build would be nice. Maybe a retry would be good in Setup global HEC
versus try once and fail?
I can certainly add a retry in that section, although I'm curious to know why it's failing. Are you getting a connection refused, or is there an actual HTTP status code being generated? If you run using ansible-playbook -vv
it'll print out the verbose debugging information.
It was a tcp disconnect, no http return code. The problem is that it happens intermittently and as the number of apps being pushed from cluster master (and possibly restarts required) increases the higher the likely hood that this occurred.
Another way I have found to work around this is to delay app installation until infrastructure base is fully built. Once built I rerun the roles but with the app variables included. That seems to help. A retry would definitely help to prevent build failure in some cases.
I'll reorder the deployment build to include apps again and grab any details I can from the failure.
Here's the error that I'm seeing, during this build 2 of 4 indexers failed this task:
2020-03-30 23:28:58,098 p=10034 u=deployer n=ansible | Monday 30 March 2020 23:28:58 +0000 (0:00:04.280) 0:05:08.886 **********
2020-03-30 23:28:58,464 p=10034 u=deployer n=ansible | RUNNING HANDLER [splunk_indexer : Setup global HEC] ****************************
2020-03-30 23:28:58,464 p=10034 u=deployer n=ansible | task path: /home/deployer/splunk-engagement-ansible/ansible/splunk-ansible/roles/splunk_common/tasks/set_as_hec_receiver.yml:4
2020-03-30 23:28:58,465 p=10034 u=deployer n=ansible | fatal: [splkidx3.psl.local]: FAILED! => {
"changed": false,
"content": "",
"elapsed": 0,
"redirected": false,
"status": -1,
"url": "https://127.0.0.1:8089/services/data/inputs/http/http"
}
MSG:
Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>
As I mentioned previously my lab environment is running on lower spec hardware (t3.medium) and I do see this more ofter the more apps I push to the cluster master. A retry like on other tasks could be very useful here when testing on non production environments.
Hmmm actually now that I'm looking a little closer there's another case of this happening during check_for_required_restarts.yml
. Both of these tasks are run just after restart_splunk.yml
2020-03-30 23:28:58,096 p=10034 u=deployer n=ansible | task path: /home/deployer/splunk-engagement-ansible/ansible/splunk-ansible/roles/splunk_common/handlers/restart_splunk.yml:26
2020-03-30 23:28:58,097 p=10034 u=deployer n=ansible | ok: [splkidx3.psl.local] => {
"changed": false,
"elapsed": 4,
"match_groupdict": {},
"match_groups": [],
"path": null,
"port": 8089,
"search_regex": null,
"state": "started"
}
2020-03-30 23:28:58,098 p=10034 u=deployer n=ansible | Monday 30 March 2020 23:28:58 +0000 (0:00:04.280) 0:05:08.886 **********
2020-03-30 23:28:58,464 p=10034 u=deployer n=ansible | RUNNING HANDLER [splunk_indexer : Setup global HEC] ****************************
2020-03-30 23:28:58,464 p=10034 u=deployer n=ansible | task path: /home/deployer/splunk-engagement-ansible/ansible/splunk-ansible/roles/splunk_common/tasks/set_as_hec_receiver.yml:4
2020-03-30 23:28:58,465 p=10034 u=deployer n=ansible | fatal: [splkidx3.psl.local]: FAILED! => {
"changed": false,
"content": "",
"elapsed": 0,
"redirected": false,
"status": -1,
"url": "https://127.0.0.1:8089/services/data/inputs/http/http"
}
MSG:
Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>
2020-03-30 23:29:22,384 p=10034 u=deployer n=ansible | RUNNING HANDLER [splunk_common : Wait for splunkd management port] *************
2020-03-30 23:29:22,384 p=10034 u=deployer n=ansible | task path: /home/deployer/splunk-engagement-ansible/ansible/splunk-ansible/roles/splunk_common/handlers/restart_splunk.yml:26
2020-03-30 23:29:22,385 p=10034 u=deployer n=ansible | ok: [splkidx1.psl.local] => {
"changed": false,
"elapsed": 2,
"match_groupdict": {},
"match_groups": [],
"path": null,
"port": 8089,
"search_regex": null,
"state": "started"
}
2020-03-30 23:29:22,386 p=10034 u=deployer n=ansible | Monday 30 March 2020 23:29:22 +0000 (0:00:02.257) 0:05:33.174 **********
2020-03-30 23:29:22,889 p=10034 u=deployer n=ansible | RUNNING HANDLER [splunk_indexer : Check for required restarts] *****************
2020-03-30 23:29:22,889 p=10034 u=deployer n=ansible | task path: /home/deployer/splunk-engagement-ansible/ansible/splunk-ansible/roles/splunk_common/tasks/check_for_required_restarts.yml:2
2020-03-30 23:29:22,890 p=10034 u=deployer n=ansible | fatal: [splkidx1.psl.local]: FAILED! => {
"changed": false,
"content": "",
"elapsed": 0,
"redirected": false,
"status": -1,
"url": "https://127.0.0.1:8089/services/messages/restart_required?output_mode=json"
}
MSG:
Status code was -1 and not [200, 404]: Request failed: <urlopen error [Errno 111] Connection refused>
Now that I'm looking at it, maybe this is an issue with this check not actually indicating splunk is ready for http requests:
- name: "Wait for splunkd management port"
wait_for:
port: "{{ splunk.svc_port }}"
Could that be the issue? In most cases the port comes up and Splunk is ready to go but under some conditions port is up but not available?
Hmm yeah, it seems like there was a problem during the start up/restart of Splunk which is at the core of the problem. My guess is a few things could be happening:
I'll need to dig into the source code of what wait_for
does when checking port
; it might even be worth augmenting this section with some actual HTTP checks on port 8089, or maybe checking other ports opened by Splunk.
In the meantime, it still seems to me that Splunk isn't starting properly (or at least, reliably) for you. Are you using systemd/init.d to start Splunk? If so, is there any information in systemctl status <splunk-service-name>
?
Alternatively, on your flaky host you can try to manually stop/start/restart Splunk (/opt/splunk/bin/splunk restart
) a few times to see if the startup prompt gives you any clues into why this could be happening. You might even have some logs in /opt/splunk/var/log/splunk/
like splunkd.log or splunkd_stderr.log which could have some error msgs explaining what happened.
As an aside, I don't think retries around the HEC configuration would help much if you're getting connection refused errors. But what we could (and should) do is probably make the startup checks more robust and fail earlier here if Splunk did not start properly (although hard to imagine why is all exit codes are 0).
Looks like I didn't dig into this one enough. The connection failure is definitely service startup related:
03-31-2020 13:05:04.312 +0000 ERROR CMSlave - Master has multisite disabled but peer has a site configuration. ht=60.000 rf=3 sf=2 ct=60.000 st=60.000 rt=60.000 rct=60.000 rst=60.000 rrt=60.000 rmst=180.000 rmrt=180.000 icps=-1 sfrt=600.000 pe=1 im=0 is=1 mob=5 mor=5 mosr=5 pb=5 rep_port=port=9887 isSsl=0 ipv6=0 cipherSuite= ecdhCurveNames= sslVersions=SSL3,TLS1.0,TLS1.1,TLS1.2 compressed=0 allowSslRenegotiation=1 dhFile= reqCliCert=0 serverCert= rootCA= commonNames= alternateNames= pptr=10 fznb=10 Empty/Default cluster pass4symmkey=false allow Empty/Default cluster pass4symmkey=true rrt=restart dft=180 abt=600 sbs=1
03-31-2020 13:05:04.312 +0000 ERROR loader - clustering initialization failed; won't start splunkd
I think rather than keep this issue open I'll have another look on my side. Appreciate you taking the time to look this over. There appears to be a timing issue with the cluster master and multisite setup, the issue becoming more apparent with different load. For sure the HEC task is not the problem.
It's possible I'm misunderstanding the intention of the
set_as_hec_receiver.yml
task. But I think there should be a check to see if splunk.hec is set.For instance in the splunk_indexer role this task is run without any check:
Which might be ok but the task
set_as_hec_receiver.yml
run this every time:It would be good if hec setup was skipped when hec is disabled or undefined. I think adding this:
when: ('hec' in splunk and splunk.hec.enable and 'token' in splunk.hec) or ('hec_token' in splunk)
to:
Setup global HEC
Remove existing HEC token
Would ensure that this task was skipped when hec wasn't defined.