splunk / splunk-platform-automator

Ansible framework providing a fast and simple way to spin up complex Splunk environments.
Apache License 2.0
117 stars 45 forks source link

Installing/Upgrade to Splunk v8.0 with Splunkenizer #10

Closed aleoliva closed 4 years ago

aleoliva commented 4 years ago

Describe the bug The installation of Splunk v8.0.0 with Splunkenizer was unsuccessful.

1.- When we tried a fresh installation with ansible/deploy_site.yml, the playbook finished successfully however the Indexers were not registered.

2.- When we tried an upgrade, with ansible/ugrade_splunk.yml, the playbook failed because it couldn't restart splunkd.

Expected behavior In Case 1 we expected this output:

[splunk@splp0cm000 ~]$ splunk show cluster-status

 Replication factor met
 Search factor met
 All data is searchable
 Indexing Ready YES

 splp0ix000.splunk.sbb.ch    024D3D82-2928-4DE0-8B9C-63265B58A66C    site1
     Searchable YES
     Status  Up
     Bucket Count=124

 splp0ix002.splunk.sbb.ch    6DA75D2C-6156-4FEC-B79F-63D18C3AC313    site1
     Searchable YES
     Status  Up
     Bucket Count=128

 splp0ix003.splunk.sbb.ch    92E8BE2A-1AD0-4B43-B392-4323E236F497    site2
     Searchable YES
     Status  Up
     Bucket Count=126

 splp0ix001.splunk.sbb.ch    B5E84637-93EA-4D0C-91B1-78B496326A52    site2
     Searchable YES
     Status  Up
     Bucket Count=132

However we receive this other one:

[splunk@splp0cm000 ~]$ splunk show cluster-status

 Replication factor not met
 Search factor not met
 All data is not searchable
 Indexing Ready NO

In Case 2, the playbook is interrupted with this error on all target servers:

...
TASK [splunk_software : start splunk] 
********************************************
2019-10-29 16:06:34,951 p=828 u=aleoliva |  fatal: [splp0cm000.splunk.sbb.ch]: FAILED! => {"changed": false, "msg": "Unable to start service splunk: Job for splunk.service failed because the control process exited with error code. See \"systemctl status splunk.service\" and \"journalctl -xe\" for details.\n"}
...

Inside the servers, we can get this information:

[linux@splp0cm000 ~]$ sudo systemctl status splunk
· splunk.service - Systemd service file for Splunk, generated by 'splunk enable boot-start'
   Loaded: loaded (/etc/systemd/system/splunk.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Wed 2019-10-30 09:32:10 CET; 1min 25s ago
  Process: 30167 ExecStartPost=/bin/bash -c chown -R 1001:1001 /sys/fs/cgroup/memory/system.slice/%n (code=exited, status=0/SUCCESS)
  Process: 30165 ExecStartPost=/bin/bash -c chown -R 1001:1001 /sys/fs/cgroup/cpu/system.slice/%n (code=exited, status=0/SUCCESS)
  Process: 30164 ExecStart=/opt/splunk/bin/splunk _internal_launch_under_systemd --accept-license --answer-yes --no-prompt (code=exited, status=1/FAILURE)
 Main PID: 30164 (code=exited, status=1/FAILURE)

Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: Failed to start Systemd service file for Splunk, generated by 'splunk enable boot-start'.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: Unit splunk.service entered failed state.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: splunk.service failed.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: splunk.service holdoff time over, scheduling restart.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: Stopped Systemd service file for Splunk, generated by 'splunk enable boot-start'.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: start request repeated too quickly for splunk.service
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: Failed to start Systemd service file for Splunk, generated by 'splunk enable boot-start'.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: Unit splunk.service entered failed state.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: splunk.service failed.

[linux@splp0cm000 ~]$ sudo journalctl -xe
-- Unit splunk.service has failed.
--
-- The result is failed.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: Unit splunk.service entered failed state.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: splunk.service failed.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: splunk.service holdoff time over, scheduling restart.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: Stopped Systemd service file for Splunk, generated by 'splunk enable boot-start'.
-- Subject: Unit splunk.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit splunk.service has finished shutting down.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: start request repeated too quickly for splunk.service
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: Failed to start Systemd service file for Splunk, generated by 'splunk enable boot-start'.
-- Subject: Unit splunk.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit splunk.service has failed.
--
-- The result is failed.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: Unit splunk.service entered failed state.
Oct 30 09:32:10 splp0cm000.splunk.sbb.ch systemd[1]: splunk.service failed.
Oct 30 09:32:15 splp0cm000.splunk.sbb.ch sudo[30170]:    linux : TTY=pts/0 ; PWD=/home/linux ; USER=root ; COMMAND=/bin/journalctl -xe
Oct 30 09:32:15 splp0cm000.splunk.sbb.ch sudo[30170]: pam_unix(sudo:session): session opened for user root by linux(uid=0)

Workaround The Case 2 can be solved and get Splunk running without issues, after running these commands:

$ splunk version --accept-license --answer-yes
$ sudo /opt/splunk/bin/splunk disable boot-start
$ sudo /opt/splunk/bin/splunk enable boot-start -user splunk -systemd-managed 1 -systemd-unit-file-name splunk
$ sudo systemctl start splunk

Desktop (please complete the following information):

For Case 2 next logs have been attached:

splunkenizer commented 4 years ago
  1. Install Splunk 8.0: Splunk 8.0 does use a new internal index (_metrics), which needs to be part of the base configs indexes.conf. Otherwise the Cluster does not work. The Base Config app (org_all_indexes) has been updated, please download them again to do the Splunk 8 installation.

  2. Upgrade to Splunk 8.0: The Upgrade playbook needs to be changed. There is a special upgrade procedure needed for upgrades to 8.0, when running with systemd, see Upgrade considerations for systemd

aleoliva commented 4 years ago

After downloading the new _org_allindexes, a fresh installation of Splunk v8.0 looks successful. The point 1 looks solved.

splunkenizer commented 4 years ago

I have updated the upgrade playbook to support Splunk 8.x now. Please test.

aleoliva commented 4 years ago

When I tested, the task:

failed for one (and only one) of the servers, a SearchHeader (splp0sh000.splunk.sbb.ch).

Attached ansible.log.gz, with further details.

splunkenizer commented 4 years ago

First of all, I need to clarify on the usage of the upgrade.yml playbook. This playbook is not intended to you globally for all the nodes at the same time. The upgrade procedure for the different nodes needs to be followed according to the docs. The playbook is not taking care about that. It does only care about the upgrade of the Splunk software on an individual node, but not the order or the maintenance state are similar. An Upgrade szenario could be handled like documented here: Dist Upgrade Example

For the error, it's not quite clear to me, what is broken. The only thing I could see, was that the systemd configuration might be not setup correctly during installation, since it does not detect systemd usage for this particular host:

2019-11-12 16:10:22,556 p=28626 u=ue60876 |  TASK [splunk_common : set use_splunk_systemd] ****************************************************************************************************************************
2019-11-12 16:10:22,877 p=28626 u=ue60876 |  skipping: [splp0sh000.splunk.sbb.ch]
aleoliva commented 4 years ago

Thanks about the Upgrade scenario. Fortunately, we are on a development phase without load, therefore we can proceed upgrades in parallel without worrying about service disruptions.

About the error specifically, sure splp0sh000 had systemd setup for Splunk. We can try to reproduce the error again (not sure if we can do it), which information would be useful for you?

Regards

aleoliva commented 4 years ago

The error, reported on previous #issuecomment-552940073, never reappeared.

We consider that it was an isolate case and we suggest to close this issue, since works good with version 8.0 now