Server problems, NWBib downtime

hbz / lobid

Linking Open Bibliographic Data

https://lobid.org/

Eclipse Public License 2.0

15 stars 4 forks source link

Server problems, NWBib downtime #318

Closed acka47 closed 7 years ago

acka47 commented 7 years ago

This is the second time in two days we get this for all our servers:

Check disk - ALL
State: CRITICAL

Additional Info:

CHECK_NRPE: Socket timeout after 10 seconds.

The last one affected is emphytos and the last notification read:

Notification Type: PROBLEM

Service: Procs httpd
Host: emphytos-lobid
Address: 193.30.112.187
State: CRITICAL

Date/Time: Wed Oct 12 18:39:42 CEST 2016

Additional Info:

PROCS CRITICAL: 89 processes with command name 'httpd2-prefork'

The warnings came in today, 17:55 to 18:39. I just tried our services on 18:45: lobid was ok, NWBib was down (and now is up again). I guess, soon the recovery notifications will arrive. Nonetheless, we will probably have to do something about this.

acka47 commented 7 years ago

@dr0i assumes that #291 might prevent similar problems in the future.

dr0i commented 7 years ago

According to the file server-tuning.conf and the fact that we use the prefork MPM the MaxClients is set to 500, so the mentioned 89 shouldn't be critical. But, maybe, the 500is way too much for a 1 GB machine. The new machine has 4GB, so it is expected that it runs better.

dr0i commented 7 years ago

Since migrating to the stronger server the CRITICAL state wasn't experienced anymore. Reopen this issue If it happens again. For now: closing.

acka47 commented 7 years ago

Since migrating to the stronger server the CRITICAL state wasn't experienced anymore.

This is not true. I received a Nagios mail on 2016-11-03, 09:42:

* Nagios *

Notification Type: PROBLEM

Service: Procs httpd Host: emphytos-lobid Address: 193.30.112.187 State: CRITICAL

Date/Time: Thu Nov 3 09:42:46 CET 2016

Additional Info:

NRPE: Command 'check_procs_apache' not defined

Nonetheless, I am fine with keeping this issue closed.

dr0i commented 7 years ago

See https://github.com/hbz/lobid/issues/291#issuecomment-258197891 which was closed 15 days before (that is, 2016-11-03). The server booted on that day at 09:13 . Some modification on configs were done. I am pretty sure that this particular log entry you have cited was just a "teething trouble".