varnish / hitch

A scalable TLS proxy by Varnish Software.
https://www.varnish-software.com/
Other
1.88k stars 156 forks source link

CRITICAL: Hitch crashed production server because of one faulty certificate pem file #369

Open dgaastra opened 2 years ago

dgaastra commented 2 years ago

Expected Behavior

Expected Hitch to just ignore the faulty pem certificate and run happily.

Current Behavior

Mar 17 12:46:36 web2 hitch[2813]: 20220317T124636.810693 [ 2813] {core} hitch 1.6.1 starting Mar 17 12:46:36 web2 hitch[2813]: 20220317T124636.812323 [ 2813] {core} Loading certificate pem files (11) Mar 17 12:46:36 web2 systemd[1]: hitch.service: Main process exited, code=exited, status=1/FAILURE ░░ Subject: Unit process exited ░░ Defined-By: systemd ░░ Support: https://www.debian.org/support ░░ ░░ An ExecStart= process belonging to unit hitch.service has exited. ░░ ░░ The process' exit code is 'exited' and its exit status is 1. Mar 17 12:46:36 web2 systemd[1]: hitch.service: Failed with result 'exit-code'.

Possible Solution

Just ignore the faulty pem file but keep on running with the correct ones.

Steps to Reproduce (for bugs)

put bogus pem file in directory where they are read from:

settings in conf file:

pem-dir = "/lego/certificates" pem-dir-glob = "*.pem"

Context

Very nasty; all production websites down for a while.

Your Environment

Debain; everything fairly up to date. hitch 1.6.1 (installed with: sudo apt install hitch )

If this was fixed after version 1.6.1, we sincerely apologise for this bug report, and, as such, hope Debian will have its packages more up-to-date

Thanks for making such a great piece of software, Dennis Gaastra

gquintard commented 2 years ago

ping @daghf, @Dridi

Keeline commented 2 years ago

We have had this issue from time to time. A partially-created or missing pem file will cause hitch to crash upon restart. Usually this is followed by a scramble to identify the offending line from the service hitch status and comment it out of the hitch.conf and restart hitch.

We have other servers where SSL is terminated with nginx. An nginx -t is fairly robust to check the configuration files and will report on missing or flawed files before we attempt to restart nginx.

The equivalent hitch -t only seems to check that the hitch.conf is syntactically correct. This is only part of the issue. It certainly knows there is a problem when it attempts to restart. Why not some kind of dry run option to prevent problems?

I wrote a small script to at least check and see that the file mentioned in the pem lines exists.

James D. Keeline


!/bin/bash

HITCH=/etc/hitch/hitch.conf ERR=0

hitch -t || ERR=1

for PEM in $(grep ^pem $HITCH | awk -F'"' '{print $2}') do if [ ! -f "$PEM" ]; then echo "$PEM missing" ERR=2 fi done

if [ $ERR -gt 0 ]; then echo "Errors found [$ERR]. Do not restart hitch." exit 1 else echo "Scan of $HITCH done. It should be OK to restart hitch." fi

dgaastra commented 1 year ago

Thanks for the script, but we really need the hitch developers to "Just ignore the faulty pem file but keep on running with the correct ones."

daghf commented 1 year ago

Apologies for taking my time in getting back to you here.

I'm sorry to say I'm struggling to reproduce this - even when trying 1.6.1. Adding bogus files to a pem-dir or adding a pem-file entry pointing at a missing file just yields Config reload failed with the service still running on the previous config.

Any way you could come up with a reproducer?

dgaastra commented 1 year ago

Hi Dag, thanks for looking into this. We have

pem-dir = "/htdocs/admin/lego/certificates"
pem-dir-glob = "*.pem"

Our PEMs are typically in the following format:

-----BEGIN CERTIFICATE-----
C1...
-----END CERTIFICATE-----

-----BEGIN CERTIFICATE-----
C2...
-----END CERTIFICATE-----

-----BEGIN CERTIFICATE-----
C3...
-----END CERTIFICATE-----
-----BEGIN RSA PRIVATE KEY-----
P1
-----END RSA PRIVATE KEY-----
-----BEGIN DH PARAMETERS-----
D1
-----END DH PARAMETERS-----
-----BEGIN DH PARAMETERS-----
D2
-----END DH PARAMETERS-----

Try to leave one or more of the sections C1-C3 or P1 or D1-2 out and see what happens. I don't exactly remember the bogus PEM in great detail, however, next time, will take a note of it when it happens again. Maybe try with leaving P1 out.

Thanks so kindly, Dennis

iammeken commented 1 year ago

Normally, I will run

hitch -t--config=/etc/hitch/hitch.conf

to check all certs before reload/restart