nigoroll / libvmod-dynamic

The Varnish dns/named director continued
BSD 2-Clause "Simplified" License
95 stars 34 forks source link

Fixing domain purge data race #82

Closed jake-dog closed 1 year ago

jake-dog commented 2 years ago

What's new

Why

There are several scenarios to consider when fixing #81. A dynamic_domain could be linked to zero or more dynamic_services, and/or a dynamic_domain linked to dynamic_service could be called directly via d.backend. A simple reference counter accommodates all of these possibilities, and prevents all services/domain timeout related data races. In a word, it is bombproof.

Timeout initiated purges will be blocked for all dynamic_domain structs referenced by dynamic_services. Only after a dynamic_service is purged will the corresponding dynamic_domain ref counter(s) be decremented.

While adding the ref counter logic I discovered two other unexpected behaviors which were linked to each other:

These behaviors together formed a partially protective layer against early purge of service linked domains. The addition of a ref counter to service linked domains removes the need for previous workarounds, fixes all early purge vulnerabilities, and makes the behavior of d.backend and d.service consistent and predictable.

This branch passed the r81.vtc test, and produces the following expected pattern of dynamic backends:

**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1645504100.142287 0.000000 0.000000
**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1645504100.142864 0.000577 0.000577
**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Lookup: 1645504100.143254 0.000000 0.000000
**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Results: 1645504100.143684 0.000430 0.000430
**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Update: 1645504100.143806 0.000552 0.000122
**** v1    0.6 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Update: 1645504100.143882 0.001595 0.001019
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1645504115.210968 0.000000 0.000000
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1645504115.211497 0.000529 0.000529
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Lookup: 1645504115.211926 0.000000 0.000000
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Results: 1645504115.212421 0.000495 0.000495
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Update: 1645504115.212502 0.000576 0.000081
**** v1   15.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1645504115.212528 0.001560 0.001031
**** v1   18.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Done: 1645504118.317811 0.000000 0.000000
**** v1   18.8 vsl|       1039 VCL_Log         c vmod-dynamic: vcl1 d1 _http._tcp.p-test-2.example.svc.cluster.local timeout
**** v1   19.8 vsl|       1041 VCL_Log         c vmod-dynamic: vcl1 d1 _http._tcp.p-test-2.example.svc.cluster.local deleted

Performance Considerations

When an existing service is resolved, and no changes are made, no additional locking is necessary, however extra looping over srv->prios is still required to determine if refcounts changed. Only when an existing service is resolved, and domains are removed from it, is additional locking required to decrement refcounts.

An additional loop of srv->prios was also added to each d.service call to ensure that last_used was updated for all linked domains, ensuring that domain timeout behavior will be consistent when the service changes or is removed.

In testing with a fairly small number of service backends (5-10), each containing only a single A record, I saw no measurable difference in CPU usage at high transaction rates (3k-5k req/s).

nigoroll commented 2 years ago

I have neglected this PR so far for two reasons, just FYI:

nigoroll commented 1 year ago

Hopefully superseded by dc08a31678f5e9ce0f5f2469e6b9bcdc728e1077