Splunk Upgrade Failing / Constant Container "Restarting"

josephnoctum commented 3 years ago

I have an old Splunk image running 7.2.5 as a heavy forwarder I'm trying to upgrade to the latest image. The Docker is using swarm for the orchestrator. After starting the new image it never stays running but is constantly in a cycle of restarting. Looking at the tail of the logs I see

2021-07-07T16:55:00.359683553Z TASK [splunk_common : Check for existing splunk secret] ************************
2021-07-07T16:55:00.360186022Z fatal: [localhost]: FAILED! => {
2021-07-07T16:55:00.360196058Z     "changed": false
2021-07-07T16:55:00.360198998Z }
2021-07-07T16:55:00.360201373Z 
2021-07-07T16:55:00.360203750Z MSG:
2021-07-07T16:55:00.360206217Z 
2021-07-07T16:55:00.360208614Z Permission denied
2021-07-07T16:55:00.364253598Z 
2021-07-07T16:55:00.364270798Z PLAY RECAP *********************************************************************
2021-07-07T16:55:00.364497637Z localhost                  : ok=6    changed=0    unreachable=0    failed=1    skipped=3    rescued=0    ignored=0

This looks similar to another issue I found here where the problem was with mounting the existing volumes, and the solution was to make some corrections in the kubernetes yaml. I'm not running kubernetes through and haven't found where to correct this issue yet.

josephnoctum commented 3 years ago

Found a workaround in one of the earlier tickets https://github.com/splunk/docker-splunk/issues/209 Tested with the Chmod change on splunk/etc/auth and splunk/etc/auth/splunk.secret and that worked so seems there still is a hickup with the user changing when updating a container.

josephnoctum commented 3 years ago

Found a workaround in one of the earlier tickets

209

Tested with the Chmod change on splunk/etc/auth and splunk/etc/auth/splunk.secret and that worked so seems there still is a hickup with the user changing when updating a container.

This has actually led to a lot of problems. So beforehand the owner was polkitd and the group was input for all the objects in the var and etc volumes. Now with the new container, the owner is glee and the group is 41812. The new container is running but over half of the inputs and configuration pages I go to inside the web interface are just showing a "loading" screen with a spinning wheel. So we've now broken functionality of over half the apps I've got installed and I can't go back to the old image because it begins to start and then goes to a "Starting Unhealthy" status with nothing in the logs to indicate why.

jnichols3 commented 2 years ago

For this I think you'll need to shell into the container and set the file permissions for everything related to splunk. once shelled in, check the user that the splunk process has and make sure the owner is set for all files, i.e. chown -R splunk:splunk /opt/splunk. Most Kubernetes admins running splunk will have an init container that just does this every time, persistent volumes and user permissions/ownership is always a pain when moving stuff around.

The Dockerfile for splunk does set the gid/uid explicitly to 41812- https://github.com/splunk/docker-splunk/blob/develop/splunk/common-files/Dockerfile#L53 but that user may be in use in your environment already?

The chmod 777 suggestion in https://github.com/splunk/docker-splunk/issues/209 really should not be done. That is making the splunk secret world readable and writable. The problem is the owner not the actual permissions on the files.

josephnoctum commented 1 year ago

At this point, I don't even see the reason to use the docker container. It's supposed to make administration more manageable, but every time I've tried to update to the latest version of the container I get the same problem, constant restart. We have a UID of 41812 so can I directly change this in the image, no of course not it's docker. So I write a yaml file and compose up, Doesn't help. I try setting the user in the variables when building the container, doesn't help. I've even completely wiped out the persistent volumes giving up on the idea that I could keep any data around, still never actually starts, starts up for about a minute and then reboots. The logs are completely useless as well

TASK [splunk_common : Check if /opt/splunk/var/lib/splunk/kvstore/mongo/splunk.key exists] * ok: [localhost] Tuesday 27 September 2022 00:10:09 +0000 (0:00:00.328) 0:00:10.821 *** FAILED - RETRYING: Start Splunk via CLI (5 retries left). FAILED - RETRYING: Start Splunk via CLI (4 retries left).

I've never actually seen a solution for anyone who is having this issue in any of the threads and I'm tired of dealing with it. Way more of a hassle than it's worth. I'm just going to use Cribl for heavy forwarders now.

splunk / docker-splunk

Splunk Upgrade Failing / Constant Container "Restarting" #499

209