Closed dazoot closed 3 years ago
Any hint on this one ? It's occurring daily.
Now it happend when getting a new certificate first time:
{"level":"info","ts":1624962005.2610888,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"3.142.122.14:55728","distributed":true}
{"level":"info","ts":1624962005.6160824,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"34.221.186.243:58126","distributed":true}
{"level":"info","ts":1624962006.940489,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"66.133.109.36:56752","distributed":true}
{"level":"info","ts":1624962008.757574,"logger":"tls.obtain","msg":"acquiring lock","identifier":"domain.org"}
{"level":"info","ts":1624962008.8545318,"logger":"tls.obtain","msg":"lock acquired","identifier":"domain.org"}
{"level":"info","ts":1624962008.8653884,"logger":"tls.obtain","msg":"certificate already exists in storage","identifier":"domain.org"}
{"level":"info","ts":1624962008.8654132,"logger":"tls.obtain","msg":"releasing lock","identifier":"domain.org"}
fatal error: concurrent map read and map write
goroutine 1084603 [running]:
runtime.throw(0x18a199d, 0x21)
runtime/panic.go:1117 +0x72 fp=0xc0009b9520 sp=0xc0009b94f0 pc=0x438652
runtime.mapaccess2_faststr(0x1641de0, 0xc001ca0930, 0xc005394ae0, 0x1c, 0xc001a15e60, 0xc00173f618)
runtime/map_faststr.go:116 +0x4a5 fp=0xc0009b9590 sp=0xc0009b9520 pc=0x414505
github.com/pteich/caddy-tlsconsul.ConsulStorage.Unlock(0x0, 0x0, 0xc00126b680, 0xc00008e270, 0xc001ca0930, 0xc0011548a0, 0x19, 0x0, 0x0, 0xa, ...)
github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:87 +0x66 fp=0xc0009b9670 sp=0xc0009b9590 pc=0x14f70a6
github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock.func1(0xc00173f5c0, 0xc002152d80, 0xc005394ae0, 0x1c)
github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:75 +0xb8 fp=0xc0009b97c0 sp=0xc0009b9670 pc=0x14fa3d8
runtime.goexit()
runtime/asm_amd64.s:1371 +0x1 fp=0xc0009b97c8 sp=0xc0009b97c0 pc=0x4728a1
created by github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock
github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:73 +0x68e
goroutine 1 [select (no cases), 2384 minutes]:
github.com/caddyserver/caddy/v2/cmd.cmdRun(0xc0001a97a0, 0x0, 0x0, 0x0)
github.com/caddyserver/caddy/v2@v2.4.3/cmd/commandfuncs.go:276 +0x1395
github.com/caddyserver/caddy/v2/cmd.Main()
github.com/caddyserver/caddy/v2@v2.4.3/cmd/main.go:85 +0x25b
main.main()
caddy/main.go:15 +0x25
goroutine 24 [select, 2384 minutes]:
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).permit(0xc0001b7d60)
github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:216 +0xb2
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).loop(0xc0001b7d60)
github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:89 +0xa8
created by github.com/caddyserver/certmagic.NewRateLimiter
github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:45 +0x148
@pteich can we sponsor this issue ?
Hi @dazoot - I'm really sorry, but I probably missed the issue notification in all the fuzz 😢 That should not happen and I'll have a look if it's just an issue that I can fix with a mutex or it something upstream (Consul lib) related.
@dazoot Thanks for reporting this problem. There was indeed a map that was not secured for concurrent read writes. This is fixed now. I'll create a new version v1.3.3 that includes this fix.
It seems now the tls register / renew certificate is stuck in acquiring lock
:
{"level":"info","ts":1625564182.286606,"logger":"tls.obtain","msg":"acquiring lock","identifier":"nl.domain.ro"}
Strange, it worked on my local setup but I'll check it out on a larger installation.
@dazoot I'll introduced a fix with another new version v1.3.4 that I had now running over 12hours and successfully received new certificates. I now consider this bug finally resolved.
After a couple of weeks i have noticed that we have stuck certificates still. After checking the nodes i see that they get stuck waiting for the lock.
{"level":"info","ts":1626567854.1713824,"logger":"tls.renew","msg":"acquiring lock","identifier":"cdn.domain.ro"}
Can this lock acquire be set to timeout after a while ?
This is a log message from Caddy (or better certmagic) but you don't see any errors or messages after this appears? I probably have to add some debug logs to get an impression if the Consul plugin is even called and where it breaks. By now I have a both, a local lock for quickly checking if the instance already has the lock (this is the one with concurrency problems before) and the real distributed Consul locks. I can add a lock wait time for getting this Consul lock. Maybe in some situations it just takes too long to get this lock or Consul is not responsive and everything get stuck.
I'll add this. Maybe as a new config option. And I'll also add debug logs to get more helpful messages in such cases.
{"level":"info","ts":1626603361.6669,"logger":"tls.cache.maintenance","msg":"attempting certificate renewal","identifiers":["nl.dom.ro"],"remaining":2588554.333102271}
{"level":"info","ts":1626603361.858531,"logger":"tls.renew","msg":"acquiring lock","identifier":"nl.dom.ro"}
I have set all nodes except one in Caddy with renew_interval
very high (ex: 15d) so just one node is actually doing certificates renew.
Even when single node is used the lock acquire never comes. Stuck in limbo :)
{"level":"info","ts":1626596162.0471478,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2591886.952852568}
{"level":"info","ts":1626596162.0472286,"logger":"tls.cache.maintenance","msg":"attempting certificate renewal","identifiers":["img.dom.ro"],"remaining":2591886.952771762}
{"level":"info","ts":1626596162.3396833,"logger":"tls.renew","msg":"acquiring lock","identifier":"img.dom.ro"}
{"level":"info","ts":1626599762.3353257,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2588286.66467504}
{"level":"info","ts":1626603361.6667585,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2584687.333243155}
{"level":"info","ts":1626606961.9950764,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2581087.004924088}
{"level":"info","ts":1626607212.4525259,"logger":"tls.renew","msg":"acquiring lock","identifier":"img.dom.ro"}
Is the real distributed Consul lock required ? Maybe we can have it optional.
For my understanding it is needed in a Caddy cluster to share the lock across the instances so that only one instance can renew or apply for new certificates.
I've changed the code in master to just try once to acquire the lock in Consul and otherwise just fail. I think that is enough for that use case. If it gets the lock - everything is ok. If not, probably another Caddy has it (or an error occurred) and therefor it's ok to return an error. So there is no need to wait forever to get it.
The code now also uses the timeout
in seconds than can already be configured for lock wait timeouts:
storage consul {
// ...
timeout 10
// ...
}
Great. Can you create a release pls ?
Done!
Out of curiosity: How many domains and requests do your cluster roughly serve? I have nearly 2.5Mill hits/day with a 4 Caddy cluster (but only <100 domains) and never ran into similar problems. So thanks for your interesting findings that helped improving ;)
We have about 4000 domains. Far less requests (link rewrite mostly). About 7 nodes. Will try it out soon. Thanks.
Still hangs.
We have to renew ~50 certs.
After restart it renews about 5, then i see a couple of "tls.renew","msg":"acquiring lock"
log messages and nothing.
Does not seem to reach the consul
locking part (ex: the timeout or the logging you have added).
Strange. I've added some simple debug messages in this version that at least show if it reaches my code. But I'm not sure how to enable debug log in Caddy.
Can it be the local locking and not the Consul lokcing which blocks ?
And i think the sink
logging in Caddy uses INFO
as log level. I did not find a way to change it. Can we change it in the module from DEBUG
to INFO
? In general there should not be that much noise form this module so INFO would be ok.
I'll change it to info logging. Hopefully we can at least where it got stuck.
I did some local tests (changed Debugf to Infof).
It does not go over the local locking. It does not reach the Consul part.
cat /var/log/caddyserver.log | grep caddy.storage | grep -v "loading data from" | grep scuba
{"level":"info","ts":1626716859.5458074,"logger":"caddy.storage.consul","msg":"trying lock for issue_cert_nl.scubadomain.tld"}
There should be next an attempt to create the Consul lock:
func (cs *ConsulStorage) Lock(ctx context.Context, key string) error {
cs.logger.Infof("trying lock for %s", key)
if _, isLocked := cs.GetLock(key); isLocked {
return nil
}
// prepare the distributed lock
cs.logger.Infof("creating Consul lock for %s", key)
I'm pretty sure it got stuck because another process holds the mutex. The reason could be the unlock function.
I've change the code for this and created a new release. I also switched from debug to info as you did locally.
New release ?
I've already created v1.3.6
Seems ok now. All certs were renewed but with Zerossl (fallback). Letsencrypt was down a while ago.
So now i have for some hosts 2 certs. One from LE which expired in 20 days and a fresh one from Zerossl.
Is this handled by Caddy or this module ?
This is handled by Certmagic inside Caddy and should be no problem (at least I had the same some time ago with some domains). This module just loads and saves the data by request.
So the expired cert will be deleted eventually ?
Exactly, the Certmagic
Storage
interface contains a Delete
function for this and this module implements it.
One of our caddy instances was renewing ~5 certificates and we got this error in the log:
The whole caddy process died. Any hint of what could have gone wrong ?