Closed jake-dog closed 1 year ago
To whom it may concern,
I've updated the original issue description with simpler reproduction steps, and a better explanation of ways in which the issue can manifest.
I have also updated my proposed solution #82 which fixes several other behaviors associated with service backends.
Thank you, I've noticed this issue, but need to prioritize paid work. Will get back to you
I let this issue sit open for so long, because I knew that we had a use-after-free due to how varnish-cache worked, which we could not safely avoid in this vmod alone. Now that varnish-cache as of 7.3 has gained the proper infrastructure with reference counted backends, I have implemented the necessary changes (with a lot of ground work to build upon) as of https://github.com/nigoroll/libvmod-dynamic/commit/41f20790aedfc6d8f5cae4b3b8ffb82cd9b302c0. I can not reproduce the issue from your test case (thank you!), so would you please give it a go? You need varnish 7.3 or later. (edit: grammar, polish)
I regret to report that I was able to reproduce the issue. Hopefully it's user error but, if it is, I haven't figured out what I did wrong.
I started a debian:11
container, installed varnish 7.3.0, and compiled 41f20790aedfc6d8f5cae4b3b8ffb82cd9b302c0, then re-ran r81.vtc, and it returned the following assertion error:
**** v1 CLI RX|Panic at: Sat, 01 Jul 2023 18:31:29 GMT
**** v1 CLI RX|Assert error in service_resolve(), vmod_dynamic_service.c line 159:
**** v1 CLI RX| Condition((t->dom)->magic == 0x1bfe1345) not true.
**** v1 CLI RX|version = varnish-7.3.0 revision 84d79120b6d17b11819a663a93160743f293e63f, vrt api = 17.0
**** v1 CLI RX|ident = Linux,6.2.6-76060206-generic,x86_64,-junix,-sdefault,-sdefault,-hcritbit,epoll
**** v1 CLI RX|now = 65401.230598 (mono), 1688236289.146731 (real)
**** v1 CLI RX|Backtrace:
**** v1 CLI RX| 0x5618eee6cd9c: varnishd(+0x5ad9c) [0x5618eee6cd9c]
**** v1 CLI RX| 0x5618eeeea578: varnishd(VAS_Fail+0x18) [0x5618eeeea578]
**** v1 CLI RX| 0x7f64303b53b9: ./vmod_cache/_vmod_dynamic.99f589bd7d2f24a756baa78d3445c179c9428e09634fc67b82af176f80022348(+0x93b9) [0x7f64303b53b9]
**** v1 CLI RX| 0x5618eee4da93: varnishd(VRT_DirectorResolve+0x43) [0x5618eee4da93]
**** v1 CLI RX| 0x5618eee4db8e: varnishd(+0x3bb8e) [0x5618eee4db8e]
**** v1 CLI RX| 0x5618eee4dcf0: varnishd(VDI_GetHdr+0x60) [0x5618eee4dcf0]
**** v1 CLI RX| 0x5618eee59693: varnishd(+0x47693) [0x5618eee59693]
**** v1 CLI RX| 0x5618eee588d6: varnishd(+0x468d6) [0x5618eee588d6]
**** v1 CLI RX| 0x5618eee94322: varnishd(+0x82322) [0x5618eee94322]
**** v1 CLI RX| 0x5618eee948d4: varnishd(+0x828d4) [0x5618eee948d4]
Looking through the test log, the early purge of the domain p-test-3.example.svc.cluster.local
, noted in the issue summary, still occurs.
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1688236272.014368 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1688236272.014684 0.000317 0.000317
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Lookup: 1688236272.014757 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Results: 1688236272.014947 0.000190 0.000190
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Update: 1688236272.015109 0.000352 0.000162
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Update: 1688236272.015133 0.000766 0.000449
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1688236273.525239 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1688236278.550587 5.025348 5.025348
**** v1 vsl| 0 Error - vmod-dynamic: vcl1 d1 _http._tcp.p-test-3.example.svc.cluster.local getdns 902 (All queries for the name timed out)
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1688236287.040717 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1688236287.040961 0.000245 0.000245
**** v1 vsl| 0 VCL_Log - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 timeout
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Done: 1688236287.041001 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Lookup: 1688236287.041046 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Results: 1688236287.041254 0.000208 0.000208
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Update: 1688236287.041366 0.000321 0.000113
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1688236287.041388 0.000672 0.000427
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1688236288.551718 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1688236288.552140 0.000422 0.000422
**** v1 vsl| 0 VCL_Log - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 deleted
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1688236288.552416 0.000698 0.000276
I am still reviewing the code changes in 41f2079, and varnish-cache 7.3, so I might better understand where things went wrong.
@jake-dog thank you very much, I am going to stare at it again.
See issue linked above, it appears to me that reference counting can not work yet.
... but still this issue is independent: services need to take references on domains within vmod_dynamic. I'll have a look.
@jake-dog can you please try dc08a31678f5e9ce0f5f2469e6b9bcdc728e1077 and reopen this ticket if it does not fix the issue? Thank you.
It seems that dc08a31 resolves the use-after-free issue, but introduces unexpected behaviors I attempted to address in #82, notably:
_the dynamic_domain
referenced by the service is still purged, resulting in director . . . returned no backend
_
At first glance this seems like the correct behavior, but its inconsistent with the behavior of dynamic_domain
by itself. For instance if the nameserver were to be inaccessible, a dynamic_domain
can continue to be used so long as it keeps being used. A dynamic_domain
linked to a dynamic_service
on the other hand will be purged as the result of a nameserver outage even if the dynamic_service
continues to be used. Further I would argue that a nameserver being inaccessible/down isn't a good reason to stop attempting to reach already resolved backends.
_the idiosyncratic behavior of limiting SRV resolve TTL to 0.5 * domain_usage_tmo
remains_
I could be wrong but I believe the only reason this behavior exists is as a mitigation for the use-after-free issue resolved in dc08a31.
Those outstanding issues I believe are better suited to new issues however. Great work, and thank you!
@jake-dog yes, it took me a de-tour to realize it in #102 but the lifetime issue should be fixed for real now. Sorry for having noticed your comment too late.
I will look at the TTL now.
2d51da74680e849c8411bcd6d85cd53c149b3052
Haven't had much time to test it but so far 2d51da7 looks great.
The only oddity I see is when running r81.vtc it logs that it deleted p-test-3
, but continues to use it:
**** v1 vsl| 0 VCL_Log - vmod-dynamic vcl1 d1 p-test-3.example.svc.cluster.local.:8893 deleted (expired)
**** v1 vsl| 1007 FetchError b backend d1.p-test-3.example.svc.cluster.local.(127.0.0.1:8893): fail errno 111 (Connection refused)
**** v1 vsl| 1006 ReqMethod c GET
**** v1 vsl| 1006 ReqURL c /
**** v1 vsl| 1006 RespStatus c 503
To be clear: this is the desired behavior for backends-- the logging is just misleading.
I will be running more tests in the next few weeks, and will certainly create an issue if I discover anything more nefarious.
Observed on 5c1c68b (w/ varnish 6.0.9) and b72c723 (w/ varnish 6.0.7).
Full panic output from VTC test log.
Reproduction steps
A gist has been created with VTC that reproduces the panic. Please follow the instructions inside of the gist, which includes installing and configuring dnsmasq.
Other test conditions, such as changing A names in SRV responses, have also succeeded in reproducing the assertion error.
Investigation
The assertion error was not reproducible when
dynamic_domain
usage timeout checks were hardcoded out, blocking timeout purges in thedynamic_search
function, demonstrating that the bug occurs when adynamic_domain
is purged while still referenced by adynamic_service
.The life cycles of
dynamic_domain
structs, and thedynamic_service
structs which reference them, are not tightly coupled. Under specific conditions data races are possible.One such scenario, as demonstrated in the provided VTC, involves three components:
dynamic_domain
structs exclusively referenced by adynamic_service
only have theirlast_used
property updated when SRV resolve is successful (code)obj->ttl
when the previous DNS resolve fails (code)dynamic_domain
, any active domain which doesn't match is subject to a timeout purge (code)With these components a panic can be instigated via a simple sequence of requests and DNS failures:
d.service
with DNS runningd.service
from step 1 until DNS resolve timeout occursd.service
to trigger a timeout purge of thed.service
from step 1d.service
from step 1 until panicThis sequence of lookups, and purges, looks like this (from provided VTC log):
Note that the service
_http._tcp.p-test-3.example.svc.cluster.local
is still active, and has not changed, but the domain it points top-test-3.example.svc.cluster.local
has timed out and was eventually deleted.Solutions
Though not completely without the possibility of data races, and not tested, the following should avoid the bulk of the race conditions. Based on 5c1c68b
The problems with the trivial solution include:
I've sent #82 introducing minimal reference counting logic to block
dynamic_domain
timeout purges while referenced by adynamic_service
, thus tightly coupling them.