Reset dehydrated lockfile (get_certificate.pid) upon container startup

ibotty / openshift-letsencrypt

MIT License

59 stars 31 forks source link

Reset dehydrated lockfile (get_certificate.pid) upon container startup #24

Closed jtcressy closed 7 years ago

jtcressy commented 7 years ago

Hi again! Your application has been working perfectly on two openshift clusters I maintain, with one small quirk that i've been able to work-around until now.

The get_certificate.pid file generated by dehydrated often cannot be updated upon container startup when the underlying volume is mounted with NFS. The container functions normally after manually deleting the file, so this should be a way to automatically handle clearing the file locks. The cron container still abides by the watcher's status after the lockfile is recreated.

This problem I discovered when restarting or recreating pods. The lock file would not be deleted and the cron AND watcher containers would fail on processing each route with:

/usr/libexec/letsencrypt-container/get_certificate: line 79: /var/lib/letsencrypt-container/get_certificate.pid: Invalid argument

I have done some testing on my end and after repeatedly restarting/recreating the containers with this edit the problem goes away.

ibotty commented 7 years ago

Why do you mount the directory from NFS? It is not necessarily persistent. The state is contained completely in the routes and the acme account secret.

Having said that, I don't oppose changing to something more robust. Afaict flock locking does not work reliably on NFS. Maybe the lock file should be on a tmpfs? That is, if you still want to continue persisting the libdir.

jtcressy commented 7 years ago

I need to persist the certificates by storing them somewhere other than an emptydir. If I don't persist them and a route gets deleted, I completely loose those certs if i need to recreate the route. Of course the bot doesn't re-apply the certificates if they already exist, so i've had to manually copy/paste the certificate on recreated routes by reading from the files on the persistent volume. But that's a different issue for another time.

If you would like to change it so that the get_certificate.pid is on its own emptydir, that would be another way to fix the issue. I've simply scripted my work-around for flock problems on NFS.

ibotty commented 7 years ago

Ah, the use case of deleting the routes, and then adding them again. That did not cross my mind yet. Does that work? I would have thought, that it will not update the routes, because the certificate is still valid. Did you test that?

Re flock on NFS. Afaict you are lucky it did work at all!

There is also a race condition when starting up containers (because one of the three (say, cron) could already get a new certificate, installing the lock as intended, while the other (say watcher), deletes the lock.)

Would you like to describe your use case in the Readme in a separate pull request? If you'd like to have a go at changing the implementation to lock in another dir, I would very appreciate it. I am in holidays for the next week and will not get to it.

jtcressy commented 7 years ago

Yes, recreating routes with valid certs in datadir results in the container logging that the certificate is still valid, but it does not re-apply the certificate data to the route. I have to manually edit the route to add the certificate. I assume it works normally when the certificate actually needs to be renewed.

I'm thinking maybe adding checks for whether the route contains the correct certificates and forcibly update if they are incorrect.

(NFS has been the largest PITA i've ever had to deal with, but I can't come up with a better solution to support dynamic provisioning of pvc's. I got lucky with it this time.)

For a possible race condition, I can probably look at telling dehydrated to put its lock file somewhere else that i can put an emptyDir on for cron to see it. I can cancel this PR for now and investigate this.