splunk / splunk-connect-for-syslog

Splunk Connect for Syslog
Apache License 2.0
148 stars 108 forks source link

SC4S failed to start because there is another container with same NAME #2475

Closed guillerg86 closed 4 days ago

guillerg86 commented 1 month ago

Was the issue replicated by support? No

What is the sc4s version ? Latest , but it doesn't aligned with version

Which operating system (including its version) are you using for hosting SC4S? Linux Debian Based

Which runtime (Docker, Podman, Docker Swarm, BYOE, MicroK8s) are you using for SC4S? Docker / Podman

Is there a pcap available? If so, would you prefer to attach it to this issue or send it to Splunk support? No

Is the issue related to the environment of the customer or Software related issue? No

Is it related to Data loss, please explain ? Protocol? Hardware specs? Yes because

Last chance index/Fallback index? No index related

Is the issue related to local customization? No

Do we have all the default indexes created? Not index related

Describe the bug

From our company we manage several SC4S based on docker in different entities that we support.

Lately we have found in several of them that the SC4S system fails to lift, indicating error that there is already a SC4S container. When running the command systemctl start sc4s.service it fails indicating that there is already another container with the name SC4S.

The solution is to

docker container rm SC4S

systemctl start sc4s.service

This has always happened when due to power problems or CPD downtime, has caused the virtual machine has suffered an outage and therefore the container has not been deleted as when the sc4s service is stopped.

The solution we have found is to execute those commands, but we have also thought of putting the following line in the sc4s.service file

Adding this line before docker run

ExecStartPre=sh -c 'docker container rm SC4S || /bin/true'

[Unit]
Description=SC4S Container
Wants=NetworkManager.service network-online.target docker.service
After=NetworkManager.service network-online.target docker.service
Requires=docker.service

[Install]
WantedBy=multi-user.target

[Service]
Environment="SC4S_IMAGE=ghcr.io/splunk/splunk-connect-for-syslog/container3:latest"

# Required mount point for syslog-ng persist data (including disk buffer)
Environment="SC4S_PERSIST_MOUNT=splunk-sc4s-var:/var/lib/syslog-ng"

# Optional mount point for local overrides and configurations; see notes in docs
Environment="SC4S_LOCAL_MOUNT=/opt/sc4s/local:/etc/syslog-ng/conf.d/local:z"

# Optional mount point for local disk archive (EWMM output) files
Environment="SC4S_ARCHIVE_MOUNT=/opt/sc4s/archive:/var/lib/syslog-ng/archive:z"

# Map location of TLS custom TLS
Environment="SC4S_TLS_MOUNT=/opt/sc4s/tls:/etc/syslog-ng/tls:z"

TimeoutStartSec=0

ExecStartPre=/usr/bin/docker pull $SC4S_IMAGE

# Note: /usr/bin/bash will not be valid path for all OS
# when startup fails on running bash check if the path is correct
ExecStartPre=/usr/bin/bash -c "/usr/bin/systemctl set-environment SC4SHOST=$(hostname -s)"

ExecStartPre=sh -c 'docker container rm SC4S || /bin/true'
ExecStart=/usr/bin/docker run \
        -e "SC4S_CONTAINER_HOST=${SC4SHOST}" \
        -v "$SC4S_PERSIST_MOUNT" \
        -v "$SC4S_LOCAL_MOUNT" \
        -v "$SC4S_ARCHIVE_MOUNT" \
        -v "$SC4S_TLS_MOUNT" \
        --env-file=/opt/sc4s/env_file \
        --network host \
        --name SC4S \
        --rm $SC4S_IMAGE

Restart=on-abnormal

To Reproduce Steps to reproduce the behavior:

  1. Create a container with same name using docker run command or using sc4s.service (but poweroff without tools the VM)
  2. Try to start / restart sc4s.service
  3. Modify sc4s.service and add line before docker run
    ExecStartPre=sh -c 'docker container rm SC4S || /bin/true'
  4. See that container starts fine.
ikheifets-splunk commented 1 month ago

@guillerg86 thanks for reporting. Probably will be better use ExecStop and ExecStopPost instead of ExecStartPre, because you know when you using ExecStartPre during first run you will got an error about can't rm container that not exists, I think it's very strange.

I can propose such solution:

ExecStop=/usr/bin/docker stop SC4S
ExecStopPost=/usr/bin/docker rm SC4S

Or alternative solution: we can use restart flag for docker run

guillerg86 commented 1 month ago

The problem about this "ExecStop" is when VM or machine was stopped without executing stop (electrical outage, etc...) thats why I thought is better do on PRE with || /bin/true

If container doesn't exist, pipe will make continue the script.

I've tested this solution (execstartpre) on a 5 clients and works fine. If VM stops suddently or because an outage, it doesn't remove the container (on stop) but before executing docker run, the service will remove the existent container if exists.

ikheifets-splunk commented 1 month ago

@guillerg86 okay I will check with VM stop

ikheifets-splunk commented 3 weeks ago

@guillerg86 I merged to develop your solution to test

guillerg86 commented 3 weeks ago

Thanks!

rjha-splunk commented 3 weeks ago

The solution is excellent and can be added to main @ikheifets-splunk

rjha-splunk commented 4 days ago

It will be added to main with next release.