Observed on b72c723acff5b2ef46c9de8cef036cee3a380a64 (6.0 branch).
A deadlock can occur when looking up a service record with a vmod_dynamic.resolver that has run out of contexts (parallel argument to resolver constructor). One thread will wait infinitely for a context so it may resolve the domain returned by the SRV record, and the other thread will wait infinitely for the dom->resolve condition to be set before it releases the context.
Per the following thread stack traces: thread 957 is waiting for the condition dom->resolve while holding a resolver context; thread 972 is waiting for a resolver context so it can set the condition dom->resolve.
Thread 957 (Thread 0x7fcf717ea700 (LWP 350411)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x0000000000436bf6 in Lck_CondWait ()
#2 0x00007fd1b4df21c9 in service_doms (prios=0x7fd1b0631080, obj=0x7fd1c341a440, ctx=0x7fcf717e9e10) at vmod_dynamic_service.c:266
#3 service_update (res=0x7fd1b4ffb360 <res_getdns>, now=1642054171.002723, priv=<optimized out>, srv=0x7fd16a822300) at vmod_dynamic_service.c:424
#4 service_lookup_thread (priv=0x7fd16a822300) at vmod_dynamic_service.c:496
#5 0x00007fd1c405eea5 in start_thread (arg=0x7fcf717ea700) at pthread_create.c:307
#6 0x00007fd1c3d879fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Thread 972 (Thread 0x7fcf6a7dc700 (LWP 350426)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fd1b4df5063 in dyn_getdns_get_context (r=r@entry=0x7fd1c341da80) at dyn_getdns.c:64
#2 0x00007fd1b4df5f16 in getdns_lookup (r=0x7fd1c341da80, node=0x7fd1c34d5dc0 "redacted.example.svc.cluster.local.", service=0x7fd1c34d7120 "42667", priv=0x7fcf6a7dbc70) at dyn_resolver_getdns.c:183
#3 0x00007fd1b4ded90e in dynamic_lookup_thread (priv=0x7fd1c35c6700) at vmod_dynamic.c:490
#4 0x00007fd1c405eea5 in start_thread (arg=0x7fcf6a7dc700) at pthread_create.c:307
#5 0x00007fd1c3d879fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
The possibility of a deadlock is the result of the context not being released until after the service is updated in the function service_lookup_thread.
To eliminate the possibility of a deadlock the resolver context needs to be released (res->srv_fini(&res_priv) from line 517) prior to locking/waiting on the dom->resolve condition. This could probably be done most easily by moving the srv_fini() call to line 423, just before the call to service_doms(), and adding another srv_fini() to handle the else condition.
A possible workaround for anyone impacted by this issue is set a larger value for parallel (defaut: 16) in the resolver constructor.
Observed on b72c723acff5b2ef46c9de8cef036cee3a380a64 (
6.0
branch).A deadlock can occur when looking up a service record with a
vmod_dynamic.resolver
that has run out of contexts (parallel
argument to resolver constructor). One thread will wait infinitely for a context so it may resolve the domain returned by the SRV record, and the other thread will wait infinitely for thedom->resolve
condition to be set before it releases the context.Per the following thread stack traces: thread 957 is waiting for the condition
dom->resolve
while holding a resolver context; thread 972 is waiting for a resolver context so it can set the conditiondom->resolve
.The possibility of a deadlock is the result of the context not being released until after the service is updated in the function
service_lookup_thread
.A resolver context is obtained https://github.com/nigoroll/libvmod-dynamic/blob/b72c723acff5b2ef46c9de8cef036cee3a380a64/src/vmod_dynamic_service.c#L488
Update service, which requires condition be set by other thread(s) attaining resolver context https://github.com/nigoroll/libvmod-dynamic/blob/b72c723acff5b2ef46c9de8cef036cee3a380a64/src/vmod_dynamic_service.c#L496
Release context obtained in step 1 https://github.com/nigoroll/libvmod-dynamic/blob/b72c723acff5b2ef46c9de8cef036cee3a380a64/src/vmod_dynamic_service.c#L517
To eliminate the possibility of a deadlock the resolver context needs to be released (
res->srv_fini(&res_priv)
from line 517) prior to locking/waiting on thedom->resolve
condition. This could probably be done most easily by moving thesrv_fini()
call to line 423, just before the call toservice_doms()
, and adding anothersrv_fini()
to handle theelse
condition.A possible workaround for anyone impacted by this issue is set a larger value for
parallel
(defaut: 16) in the resolver constructor.