Open rstupek opened 1 year ago
Do you have more context what you did and when this error message occurred? How many concurrent Caddy servers are running, what traffic they have to handle how many certificates?
However, it can happen that there is a problem releasing a Consul distributed lock without any further issues.
I have a 3 consul cluster setup and the consul plugin does appear to be storing data however every new certificate that gets requested results in the above failure. Below is an example ` 2022/09/26 16:25:46.080 INFO tls.on_demand obtaining new certificate {"remote_ip": "99.110.94.85", "remote_port": "50979", "server_name": "test.madeupcompany.com"} 2022/09/26 16:25:46.112 INFO caddy.storage.consul deleting key rw_test_82319838895603462 from Consul 2022/09/26 16:25:46.136 INFO tls.obtain acquiring lock {"identifier": "test.madeupcompany.com"} 2022/09/26 16:25:46.136 INFO caddy.storage.consul creating Consul lock for issue_cert_test.madeupcompany.com 2022/09/26 16:25:46.191 INFO tls.obtain lock acquired {"identifier": "test.madeupcompany.com"} 2022/09/26 16:25:46.196 INFO tls.obtain obtaining certificate {"identifier": "test.madeupcompany.com"} 2022/09/26 16:25:46.207 INFO http waiting on internal rate limiter {"identifiers": ["test.madeupcompany.com"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": "support@ezot.com"} 2022/09/26 16:25:46.207 INFO http done waiting on internal rate limiter {"identifiers": ["test.madeupcompany.com"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": "support@ezot.com"} 2022/09/26 16:25:46.658 INFO http.acme_client trying to solve challenge {"identifier": "test.madeupcompany.com", "challenge_type": "http-01", "ca": "https://acme-v02.api.letsencrypt.org/directory"} 2022/09/26 16:25:46.870 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "18.118.122.75:56800", "distributed": false} 2022/09/26 16:25:46.919 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "23.178.112.208:35094", "distributed": false} 2022/09/26 16:25:47.061 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "34.219.117.149:61766", "distributed": false} 2022/09/26 16:25:47.085 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "3.120.149.9:43226", "distributed": false} 2022/09/26 16:25:47.346 INFO caddy.storage.consul deleting key acme/acme-v02.api.letsencrypt.org-directory/challenge_tokens/test.madeupcompany.com.json from Consul 2022/09/26 16:25:47.358 INFO http.acme_client authorization finalized {"identifier": "test.madeupcompany.com", "authz_status": "valid"} 2022/09/26 16:25:47.358 INFO http.acme_client validations succeeded; finalizing order {"order": "https://acme-v02.api.letsencrypt.org/acme/order/743838387/129091539597"} 2022/09/26 16:25:48.531 INFO http.acme_client successfully downloaded available certificate chains {"count": 2, "first_url": "https://acme-v02.api.letsencrypt.org/acme/cert/04b35be8294d5d7e3156cbdbbc116bf6fda3"} 2022/09/26 16:25:48.563 INFO tls.obtain certificate obtained successfully {"identifier": "test.madeupcompany.com"} 2022/09/26 16:25:48.563 INFO tls.obtain releasing lock {"identifier": "test.madeupcompany.com"} 2022/09/26 16:25:48.573 ERROR caddy.storage.consul failed to release lock: lock caddytls/issue_cert_test.madeupcompany.com not found '
It seems that it created the lock without the prefix but attempts to release it with the prefix?
consul kv does appear to have the correct thing setup
{
"key": "caddytls/issue_cert_test.madeupcompany.com",
"flags": 3304740253564472344,
"value": ""
},
I came across this while working on https://github.com/pteich/caddy-tlsconsul/pull/28 --- you can see this if you run go test --tags=consul -v
. It doesn't have anything to do with the prefix: the message 2022/09/26 16:25:46.136 INFO caddy.storage.consul creating Consul lock for issue_cert_test.madeupcompany.com
comes from
which doesn't log the prefix
, while the error you are seeing comes from
which logs the error returned by Consul, which does happen to include the full path the plugin asked for.
Rather, I think that error is coming from the logic in that goroutine to cleanup in the event a lock is lost. lockActive
comes from
Per the Consul SDK documentation, this is "... a channel that is closed if our lock is lost or an error." (https://pkg.go.dev/github.com/hashicorp/consul/api#Lock). It also seems that channel is closed if the original locker also unlocked the lock.
So I think what is happening is:
cs.Unlock
is called, which makes the Consul SDK Unlock
call and removes the locklockActive
cs.Unlock
again from the goroutine, but of course we've just unlocked the lock so Consul returns an error to usI think that bit of logic needs to be rethought and a couple things done:
cs.Unlock
close that second channel and have it just exit the goroutinecs.Unlock
if we've lost the lock --- we've lost it, so we're going to get an error if we tell Consul to release the lock we no longer have. I think all we really need to do is remove it from the cs.locks
mapDo you know if the issue would cause a problem with renewing the certificate in the 90 day renewal period if consul kv still has the setting?
I don't think this will cause a direct problem with renewing a certificate. When Caddy calls Unlock
to actually unlock, if there had been a problem the error you would have gotten would have been
You don't have that, so Caddy successfully asked Consul to release the lock, and Consul did. That, however, also triggers the goroutine
where it tries to Unlock
again. This time it isn't found because the first call to Unlock
, after it asked Consul to unlock, removed the lock info from the internal cs.locks
list and that's the error we get back:
which when returned to
gives you failed to release lock: lock <lockname> not found
, and because <lockname>
is logged as cs.prefixKey(key)
you get the full K/V path.
So in fact, I don't think we're actually asking Consul to unlock a second time, because when Unlock
is called in that goroutine Caddy already thinks we don't have that lock, and that's the error you get above.
As further proof, when you look up that K/V entry you may see it there, but since there's no session
field, no current session actually holds a lock on it. Looking at the code here, it doesn't look like we delete the K/V entry, we simply release the lock, and all you'll see on the K/V side of things is that session
field disappearing.
As @kula points out it is more or less an unnecessary double unlock. But it should and will be no problem with renewing. We use it for several years and many hundreds of domains without any issue regarding renewal.
Is this the expected behavior for thus plugin?