set_as_hec_receiver.yml runs without a check for hec

credibleforce commented 4 years ago

It's possible I'm misunderstanding the intention of the set_as_hec_receiver.yml task. But I think there should be a check to see if splunk.hec is set.

For instance in the splunk_indexer role this task is run without any check:

---
- include_tasks: ../../../roles/splunk_common/tasks/set_as_hec_receiver.yml

- include_tasks: indexer_clustering.yml
  when: splunk_indexer_cluster | bool

Which might be ok but the task set_as_hec_receiver.yml run this every time:

- name: Setup global HEC
  uri:
    url: "{{ cert_prefix }}://127.0.0.1:{{ splunk.svc_port }}/services/data/inputs/http/http"
    method: POST
    user: "{{ splunk.admin_user }}"
    password: "{{ splunk.password }}"
    validate_certs: false
    body:
      disabled: "{% if ('hec' in splunk and 'enable' in splunk.hec and splunk.hec.enable | bool) or ('hec_disabled' in splunk and not splunk.hec_disabled | bool) %}0{% else %}1{% endif %}"
      enableSSL: "{% if ('hec' in splunk and 'ssl' in splunk.hec and splunk.hec.ssl | bool) or ('hec_enableSSL' in splunk and splunk.hec_enableSSL | bool) %}1{% else %}0{% endif %}"
      port: "{% if 'hec' in splunk and 'port' in splunk.hec and splunk.hec.port %}{{ splunk.hec.port }}{% elif 'hec_port' in splunk and splunk.hec_port %}{{ splunk.hec_port }}{% else %}8088{% endif %}"
    body_format: "form-urlencoded"
    status_code: 200
    timeout: 10
  no_log: "{{ hide_password }}"

It would be good if hec setup was skipped when hec is disabled or undefined. I think adding this: when: ('hec' in splunk and splunk.hec.enable and 'token' in splunk.hec) or ('hec_token' in splunk)

to:

Setup global HEC
Remove existing HEC token

Would ensure that this task was skipped when hec wasn't defined.

nwang92 commented 4 years ago

The default.yml you provide to the container is read and merged over one that's baked into splunk-ansible here: https://github.com/splunk/splunk-ansible/blob/develop/inventory/splunk_defaults_linux.yml. So if you neglect to pass in splunk.hec or splunk.hec.*, this ensures that every stanza is fully defined by falling back to some safe defaults.

credibleforce commented 4 years ago

Right, that makes sense and I guess I was misreading the intention of Setup global HEC. When I pass this:

splunk: 
        hec:
                enable: False
                port: 8088
                ssl: True
                token:

I guess I was expecting Setup global HEC to not run at all, versus passing the correct value to disabled:. But for multiple runs of this task on already configured instances I can understand why this is being done.

The problem I am seeing is that during build, Setup global HEC is failing the during cluster build. I found that if I forced Setup global HEC to be skipped during the build everything would complete.

This is probably more of an issue with the lab environment I'm running than the logical flow of the code although having a way to limit the number of calls to Splunk during build would be nice. Maybe a retry would be good in Setup global HEC versus try once and fail?

nwang92 commented 4 years ago

I can certainly add a retry in that section, although I'm curious to know why it's failing. Are you getting a connection refused, or is there an actual HTTP status code being generated? If you run using ansible-playbook -vv it'll print out the verbose debugging information.

credibleforce commented 4 years ago

It was a tcp disconnect, no http return code. The problem is that it happens intermittently and as the number of apps being pushed from cluster master (and possibly restarts required) increases the higher the likely hood that this occurred.

Another way I have found to work around this is to delay app installation until infrastructure base is fully built. Once built I rerun the roles but with the app variables included. That seems to help. A retry would definitely help to prevent build failure in some cases.

I'll reorder the deployment build to include apps again and grab any details I can from the failure.

credibleforce commented 4 years ago

Here's the error that I'm seeing, during this build 2 of 4 indexers failed this task:

2020-03-30 23:28:58,098 p=10034 u=deployer n=ansible | Monday 30 March 2020  23:28:58 +0000 (0:00:04.280)       0:05:08.886 **********
2020-03-30 23:28:58,464 p=10034 u=deployer n=ansible | RUNNING HANDLER [splunk_indexer : Setup global HEC] ****************************
2020-03-30 23:28:58,464 p=10034 u=deployer n=ansible | task path: /home/deployer/splunk-engagement-ansible/ansible/splunk-ansible/roles/splunk_common/tasks/set_as_hec_receiver.yml:4
2020-03-30 23:28:58,465 p=10034 u=deployer n=ansible | fatal: [splkidx3.psl.local]: FAILED! => {
    "changed": false,
    "content": "",
    "elapsed": 0,
    "redirected": false,
    "status": -1,
    "url": "https://127.0.0.1:8089/services/data/inputs/http/http"
}

MSG:

Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>

As I mentioned previously my lab environment is running on lower spec hardware (t3.medium) and I do see this more ofter the more apps I push to the cluster master. A retry like on other tasks could be very useful here when testing on non production environments.

credibleforce commented 4 years ago

Hmmm actually now that I'm looking a little closer there's another case of this happening during check_for_required_restarts.yml. Both of these tasks are run just after restart_splunk.yml

2020-03-30 23:28:58,096 p=10034 u=deployer n=ansible | task path: /home/deployer/splunk-engagement-ansible/ansible/splunk-ansible/roles/splunk_common/handlers/restart_splunk.yml:26
2020-03-30 23:28:58,097 p=10034 u=deployer n=ansible | ok: [splkidx3.psl.local] => {
    "changed": false,
    "elapsed": 4,
    "match_groupdict": {},
    "match_groups": [],
    "path": null,
    "port": 8089,
    "search_regex": null,
    "state": "started"
}
2020-03-30 23:28:58,098 p=10034 u=deployer n=ansible | Monday 30 March 2020  23:28:58 +0000 (0:00:04.280)       0:05:08.886 **********
2020-03-30 23:28:58,464 p=10034 u=deployer n=ansible | RUNNING HANDLER [splunk_indexer : Setup global HEC] ****************************
2020-03-30 23:28:58,464 p=10034 u=deployer n=ansible | task path: /home/deployer/splunk-engagement-ansible/ansible/splunk-ansible/roles/splunk_common/tasks/set_as_hec_receiver.yml:4
2020-03-30 23:28:58,465 p=10034 u=deployer n=ansible | fatal: [splkidx3.psl.local]: FAILED! => {
    "changed": false,
    "content": "",
    "elapsed": 0,
    "redirected": false,
    "status": -1,
    "url": "https://127.0.0.1:8089/services/data/inputs/http/http"
}

MSG:

Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>

2020-03-30 23:29:22,384 p=10034 u=deployer n=ansible | RUNNING HANDLER [splunk_common : Wait for splunkd management port] *************
2020-03-30 23:29:22,384 p=10034 u=deployer n=ansible | task path: /home/deployer/splunk-engagement-ansible/ansible/splunk-ansible/roles/splunk_common/handlers/restart_splunk.yml:26
2020-03-30 23:29:22,385 p=10034 u=deployer n=ansible | ok: [splkidx1.psl.local] => {
    "changed": false,
    "elapsed": 2,
    "match_groupdict": {},
    "match_groups": [],
    "path": null,
    "port": 8089,
    "search_regex": null,
    "state": "started"
}
2020-03-30 23:29:22,386 p=10034 u=deployer n=ansible | Monday 30 March 2020  23:29:22 +0000 (0:00:02.257)       0:05:33.174 **********
2020-03-30 23:29:22,889 p=10034 u=deployer n=ansible | RUNNING HANDLER [splunk_indexer : Check for required restarts] *****************
2020-03-30 23:29:22,889 p=10034 u=deployer n=ansible | task path: /home/deployer/splunk-engagement-ansible/ansible/splunk-ansible/roles/splunk_common/tasks/check_for_required_restarts.yml:2
2020-03-30 23:29:22,890 p=10034 u=deployer n=ansible | fatal: [splkidx1.psl.local]: FAILED! => {
    "changed": false,
    "content": "",
    "elapsed": 0,
    "redirected": false,
    "status": -1,
    "url": "https://127.0.0.1:8089/services/messages/restart_required?output_mode=json"
}

MSG:

Status code was -1 and not [200, 404]: Request failed: <urlopen error [Errno 111] Connection refused>

Now that I'm looking at it, maybe this is an issue with this check not actually indicating splunk is ready for http requests:

- name: "Wait for splunkd management port"
  wait_for:
    port: "{{ splunk.svc_port }}"

Could that be the issue? In most cases the port comes up and Splunk is ready to go but under some conditions port is up but not available?

nwang92 commented 4 years ago

Hmm yeah, it seems like there was a problem during the start up/restart of Splunk which is at the core of the problem. My guess is a few things could be happening:

Restarts do wait for the splunkd management port (8089) before proceeding, so it could be possible that the port is listening momentarily but it gets closed due to some error with Splunk
Or, the port is bound but there's some start-up issues with Splunk that prevent the actual HTTP server from listening on that port

I'll need to dig into the source code of what wait_for does when checking port; it might even be worth augmenting this section with some actual HTTP checks on port 8089, or maybe checking other ports opened by Splunk.

In the meantime, it still seems to me that Splunk isn't starting properly (or at least, reliably) for you. Are you using systemd/init.d to start Splunk? If so, is there any information in systemctl status <splunk-service-name>?

Alternatively, on your flaky host you can try to manually stop/start/restart Splunk (/opt/splunk/bin/splunk restart) a few times to see if the startup prompt gives you any clues into why this could be happening. You might even have some logs in /opt/splunk/var/log/splunk/ like splunkd.log or splunkd_stderr.log which could have some error msgs explaining what happened.

nwang92 commented 4 years ago

As an aside, I don't think retries around the HEC configuration would help much if you're getting connection refused errors. But what we could (and should) do is probably make the startup checks more robust and fail earlier here if Splunk did not start properly (although hard to imagine why is all exit codes are 0).

credibleforce commented 4 years ago

Looks like I didn't dig into this one enough. The connection failure is definitely service startup related:

03-31-2020 13:05:04.312 +0000 ERROR CMSlave - Master has multisite disabled but peer has a site configuration. ht=60.000 rf=3 sf=2 ct=60.000 st=60.000 rt=60.000 rct=60.000 rst=60.000 rrt=60.000 rmst=180.000 rmrt=180.000 icps=-1 sfrt=600.000 pe=1 im=0 is=1 mob=5 mor=5 mosr=5 pb=5 rep_port=port=9887 isSsl=0 ipv6=0 cipherSuite= ecdhCurveNames= sslVersions=SSL3,TLS1.0,TLS1.1,TLS1.2 compressed=0 allowSslRenegotiation=1 dhFile= reqCliCert=0 serverCert= rootCA= commonNames= alternateNames= pptr=10 fznb=10 Empty/Default cluster pass4symmkey=false allow Empty/Default cluster pass4symmkey=true rrt=restart dft=180 abt=600 sbs=1
03-31-2020 13:05:04.312 +0000 ERROR loader - clustering initialization failed; won't start splunkd

I think rather than keep this issue open I'll have another look on my side. Appreciate you taking the time to look this over. There appears to be a timing issue with the cluster master and multisite setup, the issue becoming more apparent with different load. For sure the HEC task is not the problem.

splunk / splunk-ansible

set_as_hec_receiver.yml runs without a check for hec #423