nigoroll / libvmod-dynamic

The Varnish dns/named director continued
BSD 2-Clause "Simplified" License
95 stars 34 forks source link

Deadlock on service lookup when all contexts in use #78

Closed jake-dog closed 2 years ago

jake-dog commented 2 years ago

Observed on b72c723acff5b2ef46c9de8cef036cee3a380a64 (6.0 branch).

A deadlock can occur when looking up a service record with a vmod_dynamic.resolver that has run out of contexts (parallel argument to resolver constructor). One thread will wait infinitely for a context so it may resolve the domain returned by the SRV record, and the other thread will wait infinitely for the dom->resolve condition to be set before it releases the context.

Per the following thread stack traces: thread 957 is waiting for the condition dom->resolve while holding a resolver context; thread 972 is waiting for a resolver context so it can set the condition dom->resolve.

Thread 957 (Thread 0x7fcf717ea700 (LWP 350411)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x0000000000436bf6 in Lck_CondWait ()
#2  0x00007fd1b4df21c9 in service_doms (prios=0x7fd1b0631080, obj=0x7fd1c341a440, ctx=0x7fcf717e9e10) at vmod_dynamic_service.c:266
#3  service_update (res=0x7fd1b4ffb360 <res_getdns>, now=1642054171.002723, priv=<optimized out>, srv=0x7fd16a822300) at vmod_dynamic_service.c:424
#4  service_lookup_thread (priv=0x7fd16a822300) at vmod_dynamic_service.c:496
#5  0x00007fd1c405eea5 in start_thread (arg=0x7fcf717ea700) at pthread_create.c:307
#6  0x00007fd1c3d879fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 972 (Thread 0x7fcf6a7dc700 (LWP 350426)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fd1b4df5063 in dyn_getdns_get_context (r=r@entry=0x7fd1c341da80) at dyn_getdns.c:64
#2  0x00007fd1b4df5f16 in getdns_lookup (r=0x7fd1c341da80, node=0x7fd1c34d5dc0 "redacted.example.svc.cluster.local.", service=0x7fd1c34d7120 "42667", priv=0x7fcf6a7dbc70) at dyn_resolver_getdns.c:183
#3  0x00007fd1b4ded90e in dynamic_lookup_thread (priv=0x7fd1c35c6700) at vmod_dynamic.c:490
#4  0x00007fd1c405eea5 in start_thread (arg=0x7fcf6a7dc700) at pthread_create.c:307
#5  0x00007fd1c3d879fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The possibility of a deadlock is the result of the context not being released until after the service is updated in the function service_lookup_thread.

  1. A resolver context is obtained https://github.com/nigoroll/libvmod-dynamic/blob/b72c723acff5b2ef46c9de8cef036cee3a380a64/src/vmod_dynamic_service.c#L488

  2. Update service, which requires condition be set by other thread(s) attaining resolver context https://github.com/nigoroll/libvmod-dynamic/blob/b72c723acff5b2ef46c9de8cef036cee3a380a64/src/vmod_dynamic_service.c#L496

  3. Release context obtained in step 1 https://github.com/nigoroll/libvmod-dynamic/blob/b72c723acff5b2ef46c9de8cef036cee3a380a64/src/vmod_dynamic_service.c#L517

To eliminate the possibility of a deadlock the resolver context needs to be released (res->srv_fini(&res_priv) from line 517) prior to locking/waiting on the dom->resolve condition. This could probably be done most easily by moving the srv_fini() call to line 423, just before the call to service_doms(), and adding another srv_fini() to handle the else condition.

A possible workaround for anyone impacted by this issue is set a larger value for parallel (defaut: 16) in the resolver constructor.

nigoroll commented 2 years ago

This is an excellent report, thank you. I will look into it.