influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.84k stars 3.55k forks source link

Default unix socket fails to initialize on modern Linux installations #24343

Open bijwaard opened 1 year ago

bijwaard commented 1 year ago

Steps to reproduce: List the minimal actions needed to reproduce the behavior.

  1. Login to an Ubuntu or Armbian Linux box
  2. Set unix-socket-enabled to true
  3. Restart influxdb

Expected behavior: I expected influxdb to start with and listening to the socket I expected the socket to be opened r/w for user and group, not r/w for others.

Actual behavior: The unix server socket could not be initialized, since /var/run is not writable by the influxdb user for security reasons. The rest of influxdb functionally also stops, the the startup-wrapper keeps retrying to connect. When a socket is opened, it is opened r/w for user, group and others. It would be more secure to disable write for others, so the access to the socket can be controlled by assigning influxdb users to /etc/group.

Normal practice is to use a subfolder in /var/run, e.g. /var/run/influxdb and have that owned by user:group influxdb:influxdb as part of the startup wrapper (systemd or init.d)

When I configure to use this folder, the socket is opened with r/w for others which gives all local users access to the socket:

% ls -lsa /var/run/influxdb                                                                                                                          <master ✗>
total 0
0 drwxr-xr-x  2 influxdb influxdb  60 Aug 16 09:54 .
0 drwxr-xr-x 27 root     root     800 Aug 16 09:33 ..
0 srwxrwxrwx  1 influxdb influxdb   0 Aug 16 09:54 influxdb.sock

The /var/run/influxdb.sock appears to be in the default configuration as well as in the code

Environment info:

Config: Copy any non-default config values here or attach the full config as a gist or file.

influxdb.conf, defaults start with #

  # Enable http service over unix domain socket
  # unix-socket-enabled = false
  unix-socket-enabled = true
  # The path of the unix domain socket.
  # bind-socket = "/var/run/influxdb.sock"

Logs:

Aug 16 07:13:38 bullseyeTestCT influxd-systemd-start.sh[205724]: ts=2023-08-16T07:13:38.950565Z lvl=info msg="Starting HTTP service" log_id=0jgIvj6G000 service=httpd authentication=false
Aug 16 07:13:38 bullseyeTestCT influxd-systemd-start.sh[205724]: ts=2023-08-16T07:13:38.951209Z lvl=info msg="Listening on HTTP" log_id=0jgIvj6G000 service=httpd addr=[::]:8086 https=false
Aug 16 07:13:38 bullseyeTestCT influxd-systemd-start.sh[205724]: run: open server: open service: listen unix /var/run/influxdb.sock: bind: permission denied
Aug 16 07:13:39 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 6 attempts...
Aug 16 07:13:40 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 7 attempts...
Aug 16 07:13:41 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 8 attempts...
Aug 16 07:13:42 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 9 attempts...
Aug 16 07:13:43 bullseyeTestCT telegraf[1368]: 2023-08-16T07:13:43Z E! [outputs.influxdb] When writing to [http://localhost:8086]: failed doing req: Post "http://localhost:8086/write?db=telegraf": dial tcp [::1]:8086: connect: connection refused
Aug 16 07:13:43 bullseyeTestCT telegraf[1368]: 2023-08-16T07:13:43Z E! [agent] Error writing to outputs.influxdb: could not write any address
Aug 16 07:13:43 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 10 attempts...
Aug 16 07:13:44 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 11 attempts...
Aug 16 07:13:45 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 12 attempts...
Aug 16 07:13:46 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 13 attempts...
Aug 16 07:13:47 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 14 attempts...
Aug 16 07:13:48 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 15 attempts...
Aug 16 07:13:50 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 16 attempts...
Aug 16 07:13:51 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 17 attempts...
Aug 16 07:13:52 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 18 attempts...
Aug 16 07:13:53 bullseyeTestCT telegraf[1368]: 2023-08-16T07:13:53Z E! [outputs.influxdb] When writing to [http://localhost:8086]: failed doing req: Post "http://localhost:8086/write?db=telegraf": dial tcp [::1]:8086: connect: connection refused
Aug 16 07:13:53 bullseyeTestCT telegraf[1368]: 2023-08-16T07:13:53Z E! [agent] Error writing to outputs.influxdb: could not write any address
Aug 16 07:13:54 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 20 attempts...
Aug 16 07:13:55 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 21 attempts...
Aug 16 07:13:56 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 22 attempts...
Aug 16 07:13:57 bullseyeTestCT influxd-systemd-start.sh[205723]: InfluxDB API unavailable after 23 attempts...

$ sudo systemctl restart influxdb.service

Aug 16 07:13:57 bullseyeTestCT sudo[205851]:   dennis : TTY=pts/0 ; PWD=/home/dennis ; USER=root ; COMMAND=/usr/bin/systemctl restart influxdb.service
Aug 16 07:13:57 bullseyeTestCT sudo[205851]: pam_unix(sudo:session): session opened for user root(uid=0) by dennis(uid=1000)
Aug 16 07:13:57 bullseyeTestCT systemd[1]: influxdb.service: Succeeded.
Aug 16 07:13:57 bullseyeTestCT systemd[1]: Stopped InfluxDB is an open-source, distributed, time series database.
Aug 16 07:13:57 bullseyeTestCT systemd[1]: influxdb.service: Consumed 9.592s CPU time.
Aug 16 07:13:57 bullseyeTestCT systemd[1]: Starting InfluxDB is an open-source, distributed, time series database...
Aug 16 07:13:57 bullseyeTestCT influxd-systemd-start.sh[205855]: ts=2023-08-16T07:13:57.949734Z lvl=info msg="InfluxDB starting" log_id=0jgIxEkG000 version=1.8.10 branch=1.8 commit=688e697c51fd
Aug 16 07:13:57 bullseyeTestCT influxd-systemd-start.sh[205855]: ts=2023-08-16T07:13:57.949992Z lvl=info msg="Go runtime" log_id=0jgIxEkG000 version=go1.13.8 maxprocs=2
Aug 16 07:13:57 bullseyeTestCT influxd-systemd-start.sh[205857]: Merging with configuration at: /etc/influxdb/influxdb.conf
Aug 16 07:13:58 bullseyeTestCT influxd-systemd-start.sh[205855]: ts=2023-08-16T07:13:58.056049Z lvl=info msg="Using data dir" log_id=0jgIxEkG000 service=store path=/var/lib/influxdb/data
Aug 16 07:13:58 bullseyeTestCT influxd-systemd-start.sh[205855]: ts=2023-08-16T07:13:58.056273Z lvl=info msg="Compaction settings" log_id=0jgIxEkG000 service=store max_concurrent_compactions=1 throughput_bytes_per_second=1048576 throughput_bytes_per_second_burst=10485760
Aug 16 07:13:58 bullseyeTestCT influxd-systemd-start.sh[205855]: ts=2023-08-16T07:13:58.056401Z lvl=info msg="Open store (start)" log_id=0jgIxEkG000 service=store trace_id=0jgIxFA0000 op_name=tsdb_open op_event=start
Aug 16 07:13:58 bullseyeTestCT influxd-systemd-start.sh[205871]: Merging with configuration at: /etc/influxdb/influxdb.conf
Aug 16 07:13:58 bullseyeTestCT influxd-systemd-start.sh[205854]: InfluxDB API unavailable after 1 attempts...
... snapshots e.g.
Aug 16 07:14:03 bullseyeTestCT influxd-systemd-start.sh[205855]: ts=2023-08-16T07:14:03.994739Z lvl=info msg="Starting snapshot service" log_id=0jgIxEkG000 service=snapshot
Aug 16 07:14:03 bullseyeTestCT influxd-systemd-start.sh[205855]: ts=2023-08-16T07:14:03.995485Z lvl=info msg="Starting continuous query service" log_id=0jgIxEkG000 service=continuous_querier
Aug 16 07:14:03 bullseyeTestCT influxd-systemd-start.sh[205855]: ts=2023-08-16T07:14:03.995772Z lvl=info msg="Starting HTTP service" log_id=0jgIxEkG000 service=httpd authentication=false
Aug 16 07:14:03 bullseyeTestCT influxd-systemd-start.sh[205855]: ts=2023-08-16T07:14:03.996670Z lvl=info msg="Listening on HTTP" log_id=0jgIxEkG000 service=httpd addr=[::]:8086 https=false
Aug 16 07:14:04 bullseyeTestCT influxd-systemd-start.sh[205854]: InfluxDB API unavailable after 6 attempts...
Aug 16 07:14:05 bullseyeTestCT influxd-systemd-start.sh[205854]: InfluxDB API unavailable after 7 attempts...
Aug 16 07:14:06 bullseyeTestCT influxd-systemd-start.sh[205854]: InfluxDB API unavailable after 8 attempts...
Aug 16 07:14:07 bullseyeTestCT influxd-systemd-start.sh[205854]: InfluxDB API unavailable after 9 attempts...
Aug 16 07:14:08 bullseyeTestCT influxd-systemd-start.sh[205854]: InfluxDB API unavailable after 10 attempts...
Aug 16 07:14:09 bullseyeTestCT influxd-systemd-start.sh[205854]: InfluxDB API unavailable after 11 attempts...
Aug 16 07:14:10 bullseyeTestCT influxd-systemd-start.sh[205854]: InfluxDB API unavailable after 12 attempts...
...
Aug 16 07:14:04 bullseyeTestCT influxd-systemd-start.sh[205855]: run: open server: open service: listen unix /var/run/influxdb.sock: bind: permission denied
Aug 16 07:15:34 bullseyeTestCT influxd-systemd-start.sh[206227]: run: open server: open service: listen unix /var/run/influxdb.sock: bind: permission denied
bijwaard commented 1 year ago

Looks like the creation of /var/run/influxd folder is in the /etc/init.d/influxd start script, but it is not reached during startup for some reason. When I move it up just to the beginning of the script, just after the USER and GROUP are initialized:

USER=influxdb
GROUP=influxdb

# pid file for the daemon
pidfile=/var/run/influxdb/influxd.pid
piddir=$(dirname $pidfile)

if [ ! -d "$piddir" ]; then
    mkdir -p $piddir
    chown $USER:$GROUP $piddir
fi

The socket file is now created when running /etc/init.d/influxd start, but still the timeout is exceeded since the startup test only verifies the HTTP socket, not the unix socket:

% sudo /etc/init.d/influxdb start 
Starting influxdb (via systemctl): influxdb.serviceJob for influxdb.service failed because a timeout was exceeded.
See "systemctl status influxdb.service" and "journalctl -xe" for details.
% systemctl status influxd
● influxdb.service - InfluxDB is an open-source, distributed, time series datab>
     Loaded: loaded (/lib/systemd/system/influxdb.service; enabled; vendor pres>
     Active: active (running) since Mon 2023-09-25 08:40:41 UTC; 24s ago
       Docs: https://docs.influxdata.com/influxdb/
    Process: 23700 ExecStart=/usr/lib/influxdb/scripts/influxd-systemd-start.sh>
   Main PID: 23701 (influxd)
      Tasks: 10 (limit: 999)
     Memory: 124.0M
        CPU: 9.286s
     CGroup: /system.slice/influxdb.service
             └─23701 /usr/bin/influxd -config /etc/influxdb/influxdb.conf

Unfortunately, when running systemctl start influxd, this does not work properly. The creation of the /var/lib/influxd folder needs to be configured in a systemd way in /etc/systemd/system/influxd.service, using the RuntimeDirectory:

[Service]
User=influxdb
Group=influxdb
LimitNOFILE=65536
EnvironmentFile=-/etc/default/influxdb
ExecStart=/usr/lib/influxdb/scripts/influxd-systemd-start.sh
KillMode=control-group
Restart=on-failure
Type=forking
PIDFile=/var/lib/influxdb/influxd.pid
RuntimeDirectory=influxdb

I guess the PIDFile should preferably also be in the /var/run/influxdb folder (see also #22564), since else it couild still be in /var/lib/influxd after a (forced) reboot.

Unfortunately, the service is killed since the timeout is exceeded, need to fix startup check for that.

bijwaard commented 1 year ago

It seems to work with the following change to /usr/lib/influxdb/scripts/influxd-systemd-start.sh that tests for the availability of the socket file:

socket="/var/run/influxdb/influxdb.sock"
while [ ! -S $socket ] || [ "${result:0:2}" != "20" -a "${result:0:2}" != "40" ]; do
  attempts=$(($attempts+1))
  echo "InfluxDB API unavailable after $attempts attempts..."
  sleep 1
  result=$(curl -k -s -o /dev/null $url -w %{http_code})
done
echo "InfluxDB started"

Alternatively, the health could be checked on the unix socket with:

% curl --unix-socket /var/run/influxdb/influxdb.sock -k -s -o /dev/null http://localhost/health -w %{http_code} 
200%