jake-dog commented 2 years ago

What's new

fix #81 with a ref counter on dynamic_domains linked to dynamic_services
fix SRV resolve TTL so it isn't limited by 0.5 * domain_usage_tmo
fix dynamic_domain timeout behavior for dynamic_domains linked to dynamic_services

Why

There are several scenarios to consider when fixing #81. A dynamic_domain could be linked to zero or more dynamic_services, and/or a dynamic_domain linked to dynamic_service could be called directly via d.backend. A simple reference counter accommodates all of these possibilities, and prevents all services/domain timeout related data races. In a word, it is bombproof.

Timeout initiated purges will be blocked for all dynamic_domain structs referenced by dynamic_services. Only after a dynamic_service is purged will the corresponding dynamic_domain ref counter(s) be decremented.

While adding the ref counter logic I discovered two other unexpected behaviors which were linked to each other:

SRV resolve TTL is limited to 0.5 * domain_usage_tmo or obj->ttl (whichever is smaller)
domains exclusively linked to services had last_used updated only when SRV resolve succeeded

These behaviors together formed a partially protective layer against early purge of service linked domains. The addition of a ref counter to service linked domains removes the need for previous workarounds, fixes all early purge vulnerabilities, and makes the behavior of d.backend and d.service consistent and predictable.

This branch passed the r81.vtc test, and produces the following expected pattern of dynamic backends:

**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1645504100.142287 0.000000 0.000000
**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1645504100.142864 0.000577 0.000577
**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Lookup: 1645504100.143254 0.000000 0.000000
**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Results: 1645504100.143684 0.000430 0.000430
**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Update: 1645504100.143806 0.000552 0.000122
**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Update: 1645504100.143882 0.001595 0.001019
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1645504115.210968 0.000000 0.000000
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1645504115.211497 0.000529 0.000529
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Lookup: 1645504115.211926 0.000000 0.000000
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Results: 1645504115.212421 0.000495 0.000495
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Update: 1645504115.212502 0.000576 0.000081
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1645504115.212528 0.001560 0.001031
**** v1   18.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Done: 1645504118.317811 0.000000 0.000000
**** v1   18.8 vsl|       1039 VCL_Log         c vmod-dynamic: vcl1 d1 _http._tcp.p-test-2.example.svc.cluster.local timeout
**** v1   19.8 vsl|       1041 VCL_Log         c vmod-dynamic: vcl1 d1 _http._tcp.p-test-2.example.svc.cluster.local deleted

Performance Considerations

When an existing service is resolved, and no changes are made, no additional locking is necessary, however extra looping over srv->prios is still required to determine if refcounts changed. Only when an existing service is resolved, and domains are removed from it, is additional locking required to decrement refcounts.

An additional loop of srv->prios was also added to each d.service call to ensure that last_used was updated for all linked domains, ensuring that domain timeout behavior will be consistent when the service changes or is removed.

In testing with a fairly small number of service backends (5-10), each containing only a single A record, I saw no measurable difference in CPU usage at high transaction rates (3k-5k req/s).

nigoroll commented 2 years ago

I have neglected this PR so far for two reasons, just FYI:

work priority issues
we need to check if this should really be solved in dynamic, or rather by using the reference counting added to varnish-cache (finally :) )

nigoroll commented 1 year ago

Hopefully superseded by dc08a31678f5e9ce0f5f2469e6b9bcdc728e1077

nigoroll / libvmod-dynamic

Fixing domain purge data race #82

What's new

Why

Performance Considerations