pteich / caddy-tlsconsul

🔒 Consul K/V storage for Caddy Web Server / Certmagic TLS data
Apache License 2.0
96 stars 17 forks source link

"failed to release lock" #27

Open rstupek opened 1 year ago

rstupek commented 1 year ago

Is this the expected behavior for thus plugin?

pteich commented 1 year ago

Do you have more context what you did and when this error message occurred? How many concurrent Caddy servers are running, what traffic they have to handle how many certificates?

However, it can happen that there is a problem releasing a Consul distributed lock without any further issues.

rstupek commented 1 year ago

I have a 3 consul cluster setup and the consul plugin does appear to be storing data however every new certificate that gets requested results in the above failure. Below is an example ` 2022/09/26 16:25:46.080 INFO tls.on_demand obtaining new certificate {"remote_ip": "99.110.94.85", "remote_port": "50979", "server_name": "test.madeupcompany.com"} 2022/09/26 16:25:46.112 INFO caddy.storage.consul deleting key rw_test_82319838895603462 from Consul 2022/09/26 16:25:46.136 INFO tls.obtain acquiring lock {"identifier": "test.madeupcompany.com"} 2022/09/26 16:25:46.136 INFO caddy.storage.consul creating Consul lock for issue_cert_test.madeupcompany.com 2022/09/26 16:25:46.191 INFO tls.obtain lock acquired {"identifier": "test.madeupcompany.com"} 2022/09/26 16:25:46.196 INFO tls.obtain obtaining certificate {"identifier": "test.madeupcompany.com"} 2022/09/26 16:25:46.207 INFO http waiting on internal rate limiter {"identifiers": ["test.madeupcompany.com"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": "support@ezot.com"} 2022/09/26 16:25:46.207 INFO http done waiting on internal rate limiter {"identifiers": ["test.madeupcompany.com"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": "support@ezot.com"} 2022/09/26 16:25:46.658 INFO http.acme_client trying to solve challenge {"identifier": "test.madeupcompany.com", "challenge_type": "http-01", "ca": "https://acme-v02.api.letsencrypt.org/directory"} 2022/09/26 16:25:46.870 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "18.118.122.75:56800", "distributed": false} 2022/09/26 16:25:46.919 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "23.178.112.208:35094", "distributed": false} 2022/09/26 16:25:47.061 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "34.219.117.149:61766", "distributed": false} 2022/09/26 16:25:47.085 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "3.120.149.9:43226", "distributed": false} 2022/09/26 16:25:47.346 INFO caddy.storage.consul deleting key acme/acme-v02.api.letsencrypt.org-directory/challenge_tokens/test.madeupcompany.com.json from Consul 2022/09/26 16:25:47.358 INFO http.acme_client authorization finalized {"identifier": "test.madeupcompany.com", "authz_status": "valid"} 2022/09/26 16:25:47.358 INFO http.acme_client validations succeeded; finalizing order {"order": "https://acme-v02.api.letsencrypt.org/acme/order/743838387/129091539597"} 2022/09/26 16:25:48.531 INFO http.acme_client successfully downloaded available certificate chains {"count": 2, "first_url": "https://acme-v02.api.letsencrypt.org/acme/cert/04b35be8294d5d7e3156cbdbbc116bf6fda3"} 2022/09/26 16:25:48.563 INFO tls.obtain certificate obtained successfully {"identifier": "test.madeupcompany.com"} 2022/09/26 16:25:48.563 INFO tls.obtain releasing lock {"identifier": "test.madeupcompany.com"} 2022/09/26 16:25:48.573 ERROR caddy.storage.consul failed to release lock: lock caddytls/issue_cert_test.madeupcompany.com not found '

It seems that it created the lock without the prefix but attempts to release it with the prefix?

consul kv does appear to have the correct thing setup

    {
            "key": "caddytls/issue_cert_test.madeupcompany.com",
            "flags": 3304740253564472344,
            "value": ""
    },
kula commented 1 year ago

I came across this while working on https://github.com/pteich/caddy-tlsconsul/pull/28 --- you can see this if you run go test --tags=consul -v. It doesn't have anything to do with the prefix: the message 2022/09/26 16:25:46.136 INFO caddy.storage.consul creating Consul lock for issue_cert_test.madeupcompany.com comes from

https://github.com/pteich/caddy-tlsconsul/blob/3811ba667582f042ca05f6d894757ab33d972542/storage.go#L67

which doesn't log the prefix, while the error you are seeing comes from

https://github.com/pteich/caddy-tlsconsul/blob/3811ba667582f042ca05f6d894757ab33d972542/storage.go#L84-L90

which logs the error returned by Consul, which does happen to include the full path the plugin asked for.

Rather, I think that error is coming from the logic in that goroutine to cleanup in the event a lock is lost. lockActive comes from

https://github.com/pteich/caddy-tlsconsul/blob/3811ba667582f042ca05f6d894757ab33d972542/storage.go#L78

Per the Consul SDK documentation, this is "... a channel that is closed if our lock is lost or an error." (https://pkg.go.dev/github.com/hashicorp/consul/api#Lock). It also seems that channel is closed if the original locker also unlocked the lock.

So I think what is happening is:

  1. cs.Unlock is called, which makes the Consul SDK Unlock call and removes the lock
  2. That triggers the closing of lockActive
  3. Which then calls cs.Unlock again from the goroutine, but of course we've just unlocked the lock so Consul returns an error to us

I think that bit of logic needs to be rethought and a couple things done:

  1. Something like having a second channel and a select in that goroutine --- if we call cs.Unlock close that second channel and have it just exit the goroutine
  2. I don't think the goroutine should call cs.Unlock if we've lost the lock --- we've lost it, so we're going to get an error if we tell Consul to release the lock we no longer have. I think all we really need to do is remove it from the cs.locks map
rstupek commented 1 year ago

Do you know if the issue would cause a problem with renewing the certificate in the 90 day renewal period if consul kv still has the setting?

kula commented 1 year ago

I don't think this will cause a direct problem with renewing a certificate. When Caddy calls Unlock to actually unlock, if there had been a problem the error you would have gotten would have been

https://github.com/pteich/caddy-tlsconsul/blob/3811ba667582f042ca05f6d894757ab33d972542/storage.go#L122

You don't have that, so Caddy successfully asked Consul to release the lock, and Consul did. That, however, also triggers the goroutine

https://github.com/pteich/caddy-tlsconsul/blob/3811ba667582f042ca05f6d894757ab33d972542/storage.go#L83-L90

where it tries to Unlock again. This time it isn't found because the first call to Unlock, after it asked Consul to unlock, removed the lock info from the internal cs.locks list and that's the error we get back:

https://github.com/pteich/caddy-tlsconsul/blob/3811ba667582f042ca05f6d894757ab33d972542/storage.go#L115-L118

which when returned to

https://github.com/pteich/caddy-tlsconsul/blob/3811ba667582f042ca05f6d894757ab33d972542/storage.go#L86-L88

gives you failed to release lock: lock <lockname> not found, and because <lockname> is logged as cs.prefixKey(key) you get the full K/V path.

So in fact, I don't think we're actually asking Consul to unlock a second time, because when Unlock is called in that goroutine Caddy already thinks we don't have that lock, and that's the error you get above.

As further proof, when you look up that K/V entry you may see it there, but since there's no session field, no current session actually holds a lock on it. Looking at the code here, it doesn't look like we delete the K/V entry, we simply release the lock, and all you'll see on the K/V side of things is that session field disappearing.

pteich commented 1 year ago

As @kula points out it is more or less an unnecessary double unlock. But it should and will be no problem with renewing. We use it for several years and many hundreds of domains without any issue regarding renewal.