fix dynamic_domain timeout behavior for dynamic_domains linked to dynamic_services
Why
There are several scenarios to consider when fixing #81. A dynamic_domain could be linked to zero or more dynamic_services, and/or a dynamic_domain linked to dynamic_service could be called directly via d.backend. A simple reference counter accommodates all of these possibilities, and prevents all services/domain timeout related data races. In a word, it is bombproof.
Timeout initiated purges will be blocked for all dynamic_domain structs referenced by dynamic_services. Only after a dynamic_service is purged will the corresponding dynamic_domain ref counter(s) be decremented.
While adding the ref counter logic I discovered two other unexpected behaviors which were linked to each other:
SRV resolve TTL is limited to 0.5 * domain_usage_tmo or obj->ttl (whichever is smaller)
domains exclusively linked to services had last_used updated only when SRV resolve succeeded
These behaviors together formed a partially protective layer against early purge of service linked domains. The addition of a ref counter to service linked domains removes the need for previous workarounds, fixes all early purge vulnerabilities, and makes the behavior of d.backend and d.service consistent and predictable.
This branch passed the r81.vtc test, and produces the following expected pattern of dynamic backends:
When an existing service is resolved, and no changes are made, no additional locking is necessary, however extra looping over srv->prios is still required to determine if refcounts changed. Only when an existing service is resolved, and domains are removed from it, is additional locking required to decrement refcounts.
An additional loop of srv->prios was also added to each d.service call to ensure that last_used was updated for all linked domains, ensuring that domain timeout behavior will be consistent when the service changes or is removed.
In testing with a fairly small number of service backends (5-10), each containing only a single A record, I saw no measurable difference in CPU usage at high transaction rates (3k-5k req/s).
What's new
dynamic_domains
linked todynamic_services
0.5 * domain_usage_tmo
dynamic_domain
timeout behavior fordynamic_domains
linked todynamic_services
Why
There are several scenarios to consider when fixing #81. A
dynamic_domain
could be linked to zero or moredynamic_services
, and/or adynamic_domain
linked todynamic_service
could be called directly viad.backend
. A simple reference counter accommodates all of these possibilities, and prevents all services/domain timeout related data races. In a word, it is bombproof.Timeout initiated purges will be blocked for all
dynamic_domain
structs referenced bydynamic_services
. Only after adynamic_service
is purged will the correspondingdynamic_domain
ref counter(s) be decremented.While adding the ref counter logic I discovered two other unexpected behaviors which were linked to each other:
0.5 * domain_usage_tmo
orobj->ttl
(whichever is smaller)last_used
updated only when SRV resolve succeededThese behaviors together formed a partially protective layer against early purge of service linked domains. The addition of a ref counter to service linked domains removes the need for previous workarounds, fixes all early purge vulnerabilities, and makes the behavior of
d.backend
andd.service
consistent and predictable.This branch passed the
r81.vtc
test, and produces the following expected pattern of dynamic backends:Performance Considerations
When an existing service is resolved, and no changes are made, no additional locking is necessary, however extra looping over
srv->prios
is still required to determine if refcounts changed. Only when an existing service is resolved, and domains are removed from it, is additional locking required to decrement refcounts.An additional loop of
srv->prios
was also added to eachd.service
call to ensure thatlast_used
was updated for all linked domains, ensuring that domain timeout behavior will be consistent when the service changes or is removed.In testing with a fairly small number of service backends (5-10), each containing only a single A record, I saw no measurable difference in CPU usage at high transaction rates (3k-5k req/s).