influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.45k stars 3.53k forks source link

Regression: systemd startup script broken for ping-authenticated HTTP endpoint (and 10s race condition) #22110

Open brevilo opened 3 years ago

brevilo commented 3 years ago

It seems c8de72ddbc broke the systemd startup script for authenticated HTTP endpoints.

Steps to reproduce:

  1. Set [http] auth-enabled = true and ping-auth-enabled = true
  2. Run curl to see if it returns 200. Obviously it doesn't.

Expected behavior: influxd launching without failure

Actual behavior:

Environment info: InfluxDB 1.8.9

Potential fixes and improvements:

brevilo commented 3 years ago

Also, the 10 seconds check timeout is a race condition and thus can and does fail on slower systems, e.g. startup during boot on a Raspberry Pi. If this is really necessary, it should be increased.

twaymouth commented 3 years ago

Also, the 10 seconds check timeout is a race condition and thus can and does fail on slower systems, e.g. startup during boot on a Raspberry Pi. If this is really necessary, it should be increased.

I noticed this exact issue after upgrading influxDB on my raspberry Pi and had to increase the timeout in order for it to start, definitely needs to be increased.

erkexzcx commented 3 years ago

I've come to this issue because of SystemD service restart-loop (caused by systemctl restart influxdb), followed by a message:

"Failed to reach influxdb http endpoint" at http://localhost:8086/health

After systemctl kill influxdb and systemctl start influxdb influxdb started successfully...

Apollon77 commented 3 years ago

Exactly I run into the 10s issue ... My DB is rather big ... I now change the sleep to 10s which helped me ... I wopuld propose such a change

bolausson commented 3 years ago

Exactly I run into the 10s issue ... My DB is rather big ... I now change the sleep to 10s which helped me ... I wopuld propose such a change

Same here, not a huge database though, but running on a somewhat overloaded Raspberry Pi 4. My workaround was changing the sleep time in the while [ "$result" != "200" ]; do loop from 1 second to 30 second (I didn't do any scientific evaluation how long it actually takes)

This happened when updating from 1.8.7 to 1.8.9.

I would debate that adding a "time based determination" if the service started successfully or not is at least questionable as a lot of comments imply about running into this issue suggest. This is up to the user/admin to determine how long he is willing to wait for a service to come online and thus, if at all, it should be set to a ridiculous large value and it must be a user defined variable which will not get overwritten in the next upgrade.

brevilo commented 3 years ago

I guess I should have opened a separate issue for the race condition ;) I'll keep it here now because of the comments so far. Changed the title accordingly...