Apache configuration is broken on letsencrypt certificate auto renewal

tr3pan commented 4 years ago

I maintain a virtualmin server that host almost 1900 sites and we use letsencrypt to issue certificates. Sometimes the apache process dies, because of configuration errors.

After further investigation and looking in the virtualmin code it seems the problem is race condition. I can't reproduce it all the time, but this is what I think happens.

In the past there were problems with issuing letencrypt certificates with "catch all" redirect on HTTP vhosts. The solution was to remove the redirect during letsencrypt domain verification and add it again back after the certificate is issued. This works fine for one domain. Later automatic renewal was added so the certificates close to expire are renewed automatically.

The problem comes that the automatic renewal is executed when collectinfo.pl script is called. Because sometimes we have a lot of domains to be renewed, the collectinfo.pl script doesn't finish. On next invocation of collectinfo.pl the auto renew starts again and problems arise. Each collectinfo.pl script modifies the apache config and then reloads the process. I notice the problem when apache config test says VirtualHost tag is not closed. Sometimes several lines are missing and the redirect is put back outside of the VirtualHost which makes the redirect global and you can't issue certificate, because of this redirect.

Once I found collectinfo.pl script every 5 minutes by default I set it to 1440 (1 day). Later I found out you can't set it more than 60 minutes. If you set it "Never" the auto renew is not called.

Is it possible to separate autorenew functionality in separate webmin cron or I am missing something obvious for auto renew?

jcameron commented 4 years ago

The next release of Virtualmin should address this, by not attempting multiple renewals at the same time. However, it sounds like that we also need a limit on the number of renewals to attempt per collectinfo run, to avoid a large number stacking up at the same time. I assume this happens on runs just after midnight, when the expiry day for multiple certs arrives?

tr3pan commented 4 years ago

I suppose it is like that, but I can't confirm. It is hard to catch it, because you find out it is already bad when a certificate can't be issued, because of "catch all" redirect placed outside of VirtualHost or the configuration is so bad that is syntactically invalid.

There is another problem that is not directly connected with virtualmin - Let's encrypt rate limits.

For users of the ACME v2 API you can create a maximum of 300 New Orders per account per 3 hours

Since certbot is using only one account for all domains on one server, you could hit this limit.

When certbot package is installed, the package installs cron(systemd.timer) and systemd.service that calls certbot renew. It is highly advisable to stop and disable both of them. If you have "catch all" redirects, the service/cron doesn't do the removal of the redirect that virtualmin script does and the acme challenge fails. This adds up to the "New Orders" quota and very fast you can hit the limit when you have a lot failures. I am not sure how the certificate are transfered to the correct place if they are issued by cron/service.

Another thing that could be improved and again is not connected to virtualmin, but somehow affects it is Let's encrypt certificate requests in /etc/letsencrypt/csr. When you have a lot of domains the number of CSRs grow very fast and this slows down the issuing of the certificate, because certbot is tryinng to generate unique filename by incrementing a number and checking if the file exists. If you have 1000s of CSRs in the directory it takes some time.

virtualmin / virtualmin-gpl

Apache configuration is broken on letsencrypt certificate auto renewal #171