Problems when using mod_md "clustered" using shared storage directory on NFS

moschlar commented 2 years ago

We keep seeing problems with the renewal of certificates on our four node Apache httpd "cluster" using mod_md (with ~1000 managed domains) with a shared storage directory on NFSv3.

The node -01 is configured with

export MDRenewWindow="33%"
export MDRenewMode="auto"

so it could be called "primary".

And the other nodes are configured with

export MDRenewWindow="30%"
export MDRenewMode="manual"

I suspect that it happens when one of the secondary nodes first reloads/restarts httpd (I think maybe due to logrotate) before the primary node got a chance to and therefore it somehow gets messed up. (Or it might be that the logrotates run too simultaneously...)

From the Apache error logs regarding the domain https://crt.sh/?q=infosys.informatik.uni-mainz.de as an example:

Node -01:

/var/log/apache2/old/error.log-20220710.gz:[Sat Jul 09 17:32:50.738944 2022] [md:notice] [pid 3548953:tid 140380633847552] AH10059: The Managed Domain infosys.informatik.uni-mainz.de has been setup and changes will be activated on next (graceful) server restart.

Node -02:

/var/log/apache2/old/error.log-20220711:[Sun Jul 10 00:04:45.042386 2022] [md:error] [pid 4077496:tid 139884661714240] (17)File exists: AH10069: infosys.informatik.uni-mainz.de: error loading staged set

Now to come to the actual issue/request: This does not show up in the monitoring page at md-status - there, all domains are listed and seem fine...

Or do you have any other ideas or shall I look for additional clues?

icing commented 2 years ago

In such a setup, with a shared fs, you are asking for trouble when you reload 2 or more cluster nodes at the same time. All reloading instances will try to activate the newly staged certificates and stumble over each other. Now, when I say "stumble" this means error message as the one you see in the log.

The activation of the new cert set in staging to domains is done as atomically as possible, using move of directories in the file system. If both nodes try this at the same time, one might fail and log the error you reported. However, the directory in domains should still be fine and have the correct content. That is probably why you do not see any errors in the md status page.

I'd recommend to first reload on cluster node and then the others. It does not matter which one. Or you use MDMessageCmd to trigger a job with sufficient privileges to move the directories yourself and then reload the cluster.

Some people also live without a shared filesystem and use MDMessageCmd to copy files on renewal and even prevent renewals to happen on all but a single cluster node.

moschlar commented 2 years ago

Yeah, I actually already took many measures to spread out reloads in general, just when writing this up, it occured to me that it might be the logrotate job that triggers these ones close to another. I've spread them out now.

However, the directories in domains were really not ok - I've already seen ones that only had the fallback key and cert files, today some still had the proper pubcert.pem but without the corresponding key. Sometimes there is no job.json but given that in this state, the final state according to job.json rellay doesn't match reality, that's forgivable.

So maybe there is something that you could probably tweak there after all ;-)

I don't really like the solutions where (ab)using MDMessageCmd for something other than notifications, though I'm not quite sure what my issue with that really is. I sense that that would be your recommendation for building a stable clustered solution, or would you go a totally different way?

icing commented 2 years ago

Looking at my code in this light again, the overall strategy is on a start/reload:

look if there is a staging/mydomain with all data needed
copy over all data to tmp/mydomain my reading (parsing) and writing
move domains/mydomain to archive/mydomain.n
move tmp/mydomain to domains/mydomain

this is all nice on a single host and survives aborted restarts quite well.

However on a cluster with a shared file system, several nodes may be working on the same tmp/mydomain and when one node moves it, it may be incomplete due to another node messing with it.

The best approach here, without some cluster-wide locking, is probably to have tmp/mydomain on a local file system. But that still may give trouble if steps 1-4 are interleaved on several cluster nodes. But at least the produced domains/mydomain directory would be complete.

The only safe way I can think of would require some cluster sync, like holding some lock while processing a staged domain. Which leads us back to MDMessageCmd with a new pre-install action.

There is file locking in the Apache runtime, but I do not know if/how that translates to your NFSv3. Any ideas?

icing commented 2 years ago

Would be nice if you could try v2.4.18 with the new MDStoreLocks directive. If that works nice in your setup, maybe we could also add such locking for renewal attempts.

icing commented 2 years ago

@moschlar maybe this escaped your notice. Could you test if the new version addresses the restarting in your cluster?

moschlar commented 2 years ago

Hey @icing,

Am Freitag, dem 29.07.2022 um 02:40 -0700 schrieb Stefan Eissing:

@moschlar maybe this escaped your notice. Could you test if the new version addresses the restarting in your cluster?

Sorry, no, I did see it but there had been many other projects that needed attention, especially with summer vacation time...

I explicitly left the GitHub notification open to come back at it when I have the time.

whereisaaron commented 2 years ago

Apache 2.4.54 only ships with v2.4.17, but would love to test MDStoreLocks this when it drops. I am assuming the approach used is compatible with NFS 4's native file locking? In that I gather flock() in newer kernel support NFS v4 file locks?

Q: From the docs I take it the MDStoreLocks time will potentially block a graceful restart such that new requests will be blocked for (up to) that time? But that only affects simultaneously restarting nodes who do not gain the lock first?

Q: Would/should nodes that were not restarting and thus did not activate the staged certificate, notice that the domains directory has a new cert that it is not using (or notice a timestamp file that is readable), and issue an MDMessageCmd to that effect? The user could then implement some variety of random back-off to locally restart the node. Or where there are only 2-3 nodes, each could just restart immediately, since one node has already finished restarting.

icing commented 2 years ago

Apache 2.4.54 only ships with v2.4.17, but would love to test MDStoreLocks this when it drops. I am assuming the approach used is compatible with NFS 4's native file locking? In that I gather flock() in newer kernel support NFS v4 file locks?

It sounds like it, but I cannot verify.

Q: From the docs I take it the MDStoreLocks time will potentially block a graceful restart such that new requests will be blocked for (up to) that time? But that only affects simultaneously restarting nodes who do not gain the lock first?

Yes, that is how it is intended to work.

Q: Would/should nodes that were not restarting and thus did not activate the staged certificate, notice that the domains directory has a new cert that it is not using (or notice a timestamp file that is readable), and issue an MDMessageCmd to that effect? The user could then implement some variety of random back-off to locally restart the node. Or where there are only 2-3 nodes, each could just restart immediately, since one node has already finished restarting.

I do not see a nice way to make that happen. The node already has read domains and to detect that some other changed it, it would need to assess it again (but when exactly?).

If you want a way to detect that your nodes are not all using the same certificate, you might want to asses the md-status handler for each node. That can give you JSON data with information about the certificate used.

whereisaaron commented 2 years ago

Thank you for the answers @icing! I'm hoping for a way for nodes to detect and restart themselves purely by observing the shared filesystem, without knowledge or network access to the other nodes.

I was thinking that if the node installing a new certificate also touched a world readable empty file somewhere in the shared filesystem (e.g. /etc/apache2/md/last_install) then mod_md on nodes could periodically check if the data/time of that world-readable file was newer than their own start time and if so, invoke MDMessageCmd restart-required. The command invoked could do the same as it would do for MDMessageCmd renewed, it just wouldn't know which domain(s) were renewed.

BTW, is MDMessage installed called during the MDStoreLocks locked period or only after the lock is released?

icing commented 21 hours ago

Closed as being stale.

icing / mod_md

Problems when using mod_md "clustered" using shared storage directory on NFS #292