fatal error: concurrent map read and map write

dazoot commented 3 years ago

One of our caddy instances was renewing ~5 certificates and we got this error in the log:

{"level":"info","ts":1624549442.7298193,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom1.eu"}
{"level":"info","ts":1624549442.745405,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom2.eu"}
{"level":"info","ts":1624549442.7781668,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom3.eu"}
{"level":"info","ts":1624549442.7857873,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom4.eu"}
{"level":"info","ts":1624549442.7933903,"logger":"tls.renew","msg":"acquiring lock","identifier":"dom5.eu"}
{"level":"info","ts":1624549443.8955014,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:13872","distributed":true}
{"level":"info","ts":1624549444.227819,"logger":"tls","msg":"served key authentication certificate","server_name":"dom1.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:13964","distributed":true}
{"level":"info","ts":1624549444.2303605,"logger":"tls","msg":"served key authentication certificate","server_name":"dom3.eu","challenge":"tls-alpn-01","remote":"3.120.130.29:35008","distributed":true}
{"level":"info","ts":1624549444.2698698,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"66.133.109.36:30662","distributed":true}
{"level":"info","ts":1624549444.3561459,"logger":"tls","msg":"served key authentication certificate","server_name":"dom5.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:13990","distributed":true}
{"level":"info","ts":1624549444.4526017,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"18.116.86.117:42258","distributed":true}
{"level":"info","ts":1624549444.5201306,"logger":"tls","msg":"served key authentication certificate","server_name":"dom4.eu","challenge":"tls-alpn-01","remote":"18.197.97.115:14026","distributed":true}
{"level":"info","ts":1624549444.5312564,"logger":"tls","msg":"served key authentication certificate","server_name":"dom2.eu","challenge":"tls-alpn-01","remote":"34.221.186.243:63866","distributed":true}
{"level":"info","ts":1624549445.189281,"logger":"tls","msg":"served key authentication certificate","server_name":"dom1.eu","challenge":"tls-alpn-01","remote":"18.116.86.117:42442","distributed":true}
{"level":"info","ts":1624549445.2701666,"logger":"tls","msg":"served key authentication certificate","server_name":"dom3.eu","challenge":"tls-alpn-01","remote":"3.142.122.14:16940","distributed":true}
{"level":"info","ts":1624549445.3078008,"logger":"tls","msg":"served key authentication certificate","server_name":"dom4.eu","challenge":"tls-alpn-01","remote":"3.142.122.14:16962","distributed":true}
{"level":"info","ts":1624549445.4730475,"logger":"tls","msg":"served key authentication certificate","server_name":"dom3.eu","challenge":"tls-alpn-01","remote":"66.133.109.36:30916","distributed":true}
{"level":"info","ts":1624549445.698484,"logger":"tls.renew","msg":"lock acquired","identifier":"dom2.eu"}
{"level":"info","ts":1624549445.7244895,"logger":"tls.renew","msg":"certificate appears to have been renewed already","identifier":"dom2.eu","remaining":7772397.275513487}
{"level":"info","ts":1624549445.72452,"logger":"tls.renew","msg":"releasing lock","identifier":"dom2.eu"}

fatal error: concurrent map read and map write

goroutine 942994 [running]:
runtime.throw(0x18a199d, 0x21)
        runtime/panic.go:1117 +0x72 fp=0xc000067d20 sp=0xc000067cf0 pc=0x438652
runtime.mapaccess2_faststr(0x1641de0, 0xc004ab5e90, 0xc005faaae0, 0x1c, 0xc0017b5980, 0xc005024118)
        runtime/map_faststr.go:116 +0x4a5 fp=0xc000067d90 sp=0xc000067d20 pc=0x414505
github.com/pteich/caddy-tlsconsul.ConsulStorage.Unlock(0x0, 0x0, 0xc0043410e0, 0xc001aa2688, 0xc004ab5e90, 0xc004bfb920, 0x19, 0x0, 0x0, 0xa, ...)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:87 +0x66 fp=0xc000067e70 sp=0xc000067d90 pc=0x14f70a6
github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock.func1(0xc0050240c0, 0xc0015d2120, 0xc005faaae0, 0x1c)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:75 +0xb8 fp=0xc000067fc0 sp=0xc000067e70 pc=0x14fa3d8
runtime.goexit()
        runtime/asm_amd64.s:1371 +0x1 fp=0xc000067fc8 sp=0xc000067fc0 pc=0x4728a1
created by github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:73 +0x68e

goroutine 1 [select (no cases), 335 minutes]:
github.com/caddyserver/caddy/v2/cmd.cmdRun(0xc00003a240, 0x0, 0x0, 0x0)
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/commandfuncs.go:276 +0x1395
github.com/caddyserver/caddy/v2/cmd.Main()
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/main.go:85 +0x25b
main.main()
        caddy/main.go:15 +0x25

goroutine 9 [select, 335 minutes]:
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).permit(0xc000097180)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:216 +0xb2
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).loop(0xc000097180)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:89 +0xa8
created by github.com/caddyserver/certmagic.NewRateLimiter
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:45 +0x148

goroutine 49 [chan receive, 335 minutes]:
github.com/caddyserver/caddy/v2.trapSignalsCrossPlatform.func1()
        github.com/caddyserver/caddy/v2@v2.4.3/sigtrap.go:42 +0x129
created by github.com/caddyserver/caddy/v2.trapSignalsCrossPlatform
        github.com/caddyserver/caddy/v2@v2.4.3/sigtrap.go:37 +0x35

goroutine 50 [chan receive, 335 minutes]:
github.com/caddyserver/caddy/v2.trapSignalsPosix.func1()
        github.com/caddyserver/caddy/v2@v2.4.3/sigtrap_posix.go:34 +0x139
created by github.com/caddyserver/caddy/v2.trapSignalsPosix

The whole caddy process died. Any hint of what could have gone wrong ?

dazoot commented 3 years ago

Any hint on this one ? It's occurring daily.

dazoot commented 3 years ago

Now it happend when getting a new certificate first time:

{"level":"info","ts":1624962005.2610888,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"3.142.122.14:55728","distributed":true}
{"level":"info","ts":1624962005.6160824,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"34.221.186.243:58126","distributed":true}
{"level":"info","ts":1624962006.940489,"logger":"tls","msg":"served key authentication certificate","server_name":"domain.org","challenge":"tls-alpn-01","remote":"66.133.109.36:56752","distributed":true}
{"level":"info","ts":1624962008.757574,"logger":"tls.obtain","msg":"acquiring lock","identifier":"domain.org"}
{"level":"info","ts":1624962008.8545318,"logger":"tls.obtain","msg":"lock acquired","identifier":"domain.org"}
{"level":"info","ts":1624962008.8653884,"logger":"tls.obtain","msg":"certificate already exists in storage","identifier":"domain.org"}
{"level":"info","ts":1624962008.8654132,"logger":"tls.obtain","msg":"releasing lock","identifier":"domain.org"}
fatal error: concurrent map read and map write

goroutine 1084603 [running]:
runtime.throw(0x18a199d, 0x21)
        runtime/panic.go:1117 +0x72 fp=0xc0009b9520 sp=0xc0009b94f0 pc=0x438652
runtime.mapaccess2_faststr(0x1641de0, 0xc001ca0930, 0xc005394ae0, 0x1c, 0xc001a15e60, 0xc00173f618)
        runtime/map_faststr.go:116 +0x4a5 fp=0xc0009b9590 sp=0xc0009b9520 pc=0x414505
github.com/pteich/caddy-tlsconsul.ConsulStorage.Unlock(0x0, 0x0, 0xc00126b680, 0xc00008e270, 0xc001ca0930, 0xc0011548a0, 0x19, 0x0, 0x0, 0xa, ...)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:87 +0x66 fp=0xc0009b9670 sp=0xc0009b9590 pc=0x14f70a6
github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock.func1(0xc00173f5c0, 0xc002152d80, 0xc005394ae0, 0x1c)
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:75 +0xb8 fp=0xc0009b97c0 sp=0xc0009b9670 pc=0x14fa3d8
runtime.goexit()
        runtime/asm_amd64.s:1371 +0x1 fp=0xc0009b97c8 sp=0xc0009b97c0 pc=0x4728a1
created by github.com/pteich/caddy-tlsconsul.ConsulStorage.Lock
        github.com/pteich/caddy-tlsconsul@v1.3.2/storage.go:73 +0x68e

goroutine 1 [select (no cases), 2384 minutes]:
github.com/caddyserver/caddy/v2/cmd.cmdRun(0xc0001a97a0, 0x0, 0x0, 0x0)
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/commandfuncs.go:276 +0x1395
github.com/caddyserver/caddy/v2/cmd.Main()
        github.com/caddyserver/caddy/v2@v2.4.3/cmd/main.go:85 +0x25b
main.main()
        caddy/main.go:15 +0x25

goroutine 24 [select, 2384 minutes]:
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).permit(0xc0001b7d60)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:216 +0xb2
github.com/caddyserver/certmagic.(*RingBufferRateLimiter).loop(0xc0001b7d60)
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:89 +0xa8
created by github.com/caddyserver/certmagic.NewRateLimiter
        github.com/caddyserver/certmagic@v0.14.0/ratelimiter.go:45 +0x148

dazoot commented 3 years ago

@pteich can we sponsor this issue ?

pteich commented 3 years ago

Hi @dazoot - I'm really sorry, but I probably missed the issue notification in all the fuzz 😢 That should not happen and I'll have a look if it's just an issue that I can fix with a mutex or it something upstream (Consul lib) related.

pteich commented 3 years ago

@dazoot Thanks for reporting this problem. There was indeed a map that was not secured for concurrent read writes. This is fixed now. I'll create a new version v1.3.3 that includes this fix.

dazoot commented 3 years ago

It seems now the tls register / renew certificate is stuck in acquiring lock:

{"level":"info","ts":1625564182.286606,"logger":"tls.obtain","msg":"acquiring lock","identifier":"nl.domain.ro"}

pteich commented 3 years ago

Strange, it worked on my local setup but I'll check it out on a larger installation.

pteich commented 3 years ago

@dazoot I'll introduced a fix with another new version v1.3.4 that I had now running over 12hours and successfully received new certificates. I now consider this bug finally resolved.

dazoot commented 2 years ago

After a couple of weeks i have noticed that we have stuck certificates still. After checking the nodes i see that they get stuck waiting for the lock.

{"level":"info","ts":1626567854.1713824,"logger":"tls.renew","msg":"acquiring lock","identifier":"cdn.domain.ro"}

Can this lock acquire be set to timeout after a while ?

pteich commented 2 years ago

This is a log message from Caddy (or better certmagic) but you don't see any errors or messages after this appears? I probably have to add some debug logs to get an impression if the Consul plugin is even called and where it breaks. By now I have a both, a local lock for quickly checking if the instance already has the lock (this is the one with concurrency problems before) and the real distributed Consul locks. I can add a lock wait time for getting this Consul lock. Maybe in some situations it just takes too long to get this lock or Consul is not responsive and everything get stuck.

I'll add this. Maybe as a new config option. And I'll also add debug logs to get more helpful messages in such cases.

dazoot commented 2 years ago

    {"level":"info","ts":1626603361.6669,"logger":"tls.cache.maintenance","msg":"attempting certificate renewal","identifiers":["nl.dom.ro"],"remaining":2588554.333102271}
    {"level":"info","ts":1626603361.858531,"logger":"tls.renew","msg":"acquiring lock","identifier":"nl.dom.ro"}

I have set all nodes except one in Caddy with renew_interval very high (ex: 15d) so just one node is actually doing certificates renew.

Even when single node is used the lock acquire never comes. Stuck in limbo :)

{"level":"info","ts":1626596162.0471478,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2591886.952852568}
{"level":"info","ts":1626596162.0472286,"logger":"tls.cache.maintenance","msg":"attempting certificate renewal","identifiers":["img.dom.ro"],"remaining":2591886.952771762}
{"level":"info","ts":1626596162.3396833,"logger":"tls.renew","msg":"acquiring lock","identifier":"img.dom.ro"}
{"level":"info","ts":1626599762.3353257,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2588286.66467504}
{"level":"info","ts":1626603361.6667585,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2584687.333243155}
{"level":"info","ts":1626606961.9950764,"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal","identifiers":["img.dom.ro"],"remaining":2581087.004924088}
{"level":"info","ts":1626607212.4525259,"logger":"tls.renew","msg":"acquiring lock","identifier":"img.dom.ro"}

dazoot commented 2 years ago

Is the real distributed Consul lock required ? Maybe we can have it optional.

pteich commented 2 years ago

For my understanding it is needed in a Caddy cluster to share the lock across the instances so that only one instance can renew or apply for new certificates.

I've changed the code in master to just try once to acquire the lock in Consul and otherwise just fail. I think that is enough for that use case. If it gets the lock - everything is ok. If not, probably another Caddy has it (or an error occurred) and therefor it's ok to return an error. So there is no need to wait forever to get it.

The code now also uses the timeout in seconds than can already be configured for lock wait timeouts:

storage consul {
// ...
   timeout      10
// ...
}

dazoot commented 2 years ago

Great. Can you create a release pls ?

pteich commented 2 years ago

Done!

Out of curiosity: How many domains and requests do your cluster roughly serve? I have nearly 2.5Mill hits/day with a 4 Caddy cluster (but only <100 domains) and never ran into similar problems. So thanks for your interesting findings that helped improving ;)

dazoot commented 2 years ago

We have about 4000 domains. Far less requests (link rewrite mostly). About 7 nodes. Will try it out soon. Thanks.

dazoot commented 2 years ago

Still hangs. We have to renew ~50 certs. After restart it renews about 5, then i see a couple of "tls.renew","msg":"acquiring lock" log messages and nothing.

Does not seem to reach the consul locking part (ex: the timeout or the logging you have added).

pteich commented 2 years ago

Strange. I've added some simple debug messages in this version that at least show if it reaches my code. But I'm not sure how to enable debug log in Caddy.

dazoot commented 2 years ago

Can it be the local locking and not the Consul lokcing which blocks ?

dazoot commented 2 years ago

And i think the sink logging in Caddy uses INFO as log level. I did not find a way to change it. Can we change it in the module from DEBUG to INFO ? In general there should not be that much noise form this module so INFO would be ok.

pteich commented 2 years ago

I'll change it to info logging. Hopefully we can at least where it got stuck.

dazoot commented 2 years ago

I did some local tests (changed Debugf to Infof).

It does not go over the local locking. It does not reach the Consul part.

cat /var/log/caddyserver.log | grep caddy.storage | grep -v "loading data from"  | grep scuba
{"level":"info","ts":1626716859.5458074,"logger":"caddy.storage.consul","msg":"trying lock for issue_cert_nl.scubadomain.tld"}

There should be next an attempt to create the Consul lock:

func (cs *ConsulStorage) Lock(ctx context.Context, key string) error {
    cs.logger.Infof("trying lock for %s", key)

    if _, isLocked := cs.GetLock(key); isLocked {
        return nil
    }

    // prepare the distributed lock
    cs.logger.Infof("creating Consul lock for %s", key)

pteich commented 2 years ago

I'm pretty sure it got stuck because another process holds the mutex. The reason could be the unlock function.

I've change the code for this and created a new release. I also switched from debug to info as you did locally.

dazoot commented 2 years ago

New release ?

pteich commented 2 years ago

I've already created v1.3.6

dazoot commented 2 years ago

Seems ok now. All certs were renewed but with Zerossl (fallback). Letsencrypt was down a while ago.

So now i have for some hosts 2 certs. One from LE which expired in 20 days and a fresh one from Zerossl.

Is this handled by Caddy or this module ?

pteich commented 2 years ago

This is handled by Certmagic inside Caddy and should be no problem (at least I had the same some time ago with some domains). This module just loads and saves the data by request.

dazoot commented 2 years ago

So the expired cert will be deleted eventually ?

pteich commented 2 years ago

Exactly, the Certmagic Storage interface contains a Delete function for this and this module implements it.

pteich / caddy-tlsconsul

fatal error: concurrent map read and map write #18