nigoroll / libvmod-dynamic

The Varnish dns/named director continued
BSD 2-Clause "Simplified" License
95 stars 34 forks source link

Assertion error for service with purged dynamic_domain #81

Closed jake-dog closed 1 year ago

jake-dog commented 2 years ago

Observed on 5c1c68b (w/ varnish 6.0.9) and b72c723 (w/ varnish 6.0.7).

Panic at: Fri, 28 Jan 2022 04:08:36 GMT
Assert error in service_resolve(), vmod_dynamic_service.c line 150:
  Condition((t->dom)->magic == 0x1bfe1345) not true.
version = varnish-6.0.9 revision 3383d3d02c6ee459f20e1f46cfcb9692726a6387, vrt api = 7.1
ident = Linux,5.4.77-flatcar,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 18260503.813988 (mono), 1643342916.576233 (real)
Backtrace:
  0x43c46e: varnishd() [0x43c46e]
  0x49f272: varnishd(VAS_Fail+0x42) [0x49f272]
  0x7fb8159f13da: ./vmod_cache/_vmod_dynamic.04887395d86555991e34ec9874763b01e23b00135a5367941f17a5e0076a9451(+0x83da) [0x7fb8159f13da]
  0x4216a3: varnishd() [0x4216a3]
  0x421d83: varnishd(VDI_GetHdr+0x33) [0x421d83]
  0x42a859: varnishd() [0x42a859]
  0x45a4db: varnishd() [0x45a4db]
  0x45a9c0: varnishd() [0x45a9c0]
  0x7fb8257d2ea5: /lib64/libpthread.so.0(+0x7ea5) [0x7fb8257d2ea5]
  0x7fb8254fbb0d: /lib64/libc.so.6(clone+0x6d) [0x7fb8254fbb0d]

Full panic output from VTC test log.

Reproduction steps

A gist has been created with VTC that reproduces the panic. Please follow the instructions inside of the gist, which includes installing and configuring dnsmasq.

Other test conditions, such as changing A names in SRV responses, have also succeeded in reproducing the assertion error.

Investigation

The assertion error was not reproducible when dynamic_domain usage timeout checks were hardcoded out, blocking timeout purges in the dynamic_search function, demonstrating that the bug occurs when a dynamic_domain is purged while still referenced by a dynamic_service.

The life cycles of dynamic_domain structs, and the dynamic_service structs which reference them, are not tightly coupled. Under specific conditions data races are possible.

One such scenario, as demonstrated in the provided VTC, involves three components:

  1. dynamic_domain structs exclusively referenced by a dynamic_service only have their last_used property updated when SRV resolve is successful (code)
  2. the delay between SRV resolves is set to obj->ttl when the previous DNS resolve fails (code)
  3. when retrieving a specific dynamic_domain, any active domain which doesn't match is subject to a timeout purge (code)

With these components a panic can be instigated via a simple sequence of requests and DNS failures:

  1. Send a request for a valid d.service with DNS running
  2. Kill/block DNS
  3. Send several more requests to the same d.service from step 1 until DNS resolve timeout occurs
  4. Enable/unblock DNS
  5. Send several requests for a different d.service to trigger a timeout purge of the d.service from step 1
  6. Continue sending requests to the d.service from step 1 until panic

This sequence of lookups, and purges, looks like this (from provided VTC log):

**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1645503346.958248 0.000000 0.000000
**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1645503346.958876 0.000628 0.000628
**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Lookup: 1645503346.959250 0.000000 0.000000
**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Results: 1645503346.959764 0.000514 0.000514
**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Update: 1645503346.959862 0.000613 0.000099
**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Update: 1645503346.959927 0.001679 0.001051
**** v1    2.2 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1645503348.470227 0.000000 0.000000
**** v1    7.3 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1645503353.475618 5.005390 5.005390
**** v1    7.3 vsl|          0 Error           - vmod-dynamic: vcl1 d1 _http._tcp.p-test-3.example.svc.cluster.local getdns 902 (All queries for the name timed out)
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1645503362.042588 0.000000 0.000000
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1645503362.043147 0.000558 0.000558
**** v1   15.8 vsl|          0 VCL_Log         - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 timeout
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Done: 1645503362.043266 0.000000 0.000000
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Lookup: 1645503362.043516 0.000000 0.000000
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Results: 1645503362.043797 0.000281 0.000281
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Update: 1645503362.043895 0.000379 0.000098
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1645503362.043962 0.001374 0.000815
**** v1   17.3 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1645503363.554317 0.000000 0.000000
**** v1   17.3 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1645503363.554597 0.000280 0.000280
**** v1   17.3 vsl|          0 VCL_Log         - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 deleted
**** v1   17.3 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1645503363.554614 0.000297 0.000017

Note that the service _http._tcp.p-test-3.example.svc.cluster.local is still active, and has not changed, but the domain it points to p-test-3.example.svc.cluster.local has timed out and was eventually deleted.

Solutions

Though not completely without the possibility of data races, and not tested, the following should avoid the bulk of the race conditions. Based on 5c1c68b

diff --git a/src/vmod_dynamic_service.c b/src/vmod_dynamic_service.c
index 2140a2d..c77ca06 100644
--- a/src/vmod_dynamic_service.c
+++ b/src/vmod_dynamic_service.c
@@ -521,3 +521,6 @@ service_lookup_thread(void *priv)
                            res->name, ret, res->strerror(ret));
-                       srv->deadline = results + obj->ttl;
+                       if (obj->ttl < obj->domain_usage_tmo / 2)
+                               srv->deadline = results + obj->ttl;
+                       else
+                               srv->deadline = results + (obj->domain_usage_tmo / 2);
                        dbg_res_details(NULL, srv->obj, res, res_priv);
@@ -737,2 +740,5 @@ vmod_director_service(VRT_CTX, struct VPFX(dynamic_director) *obj,
        struct dynamic_service *srv;
+       const struct service_prios *prios;
+       const struct service_prio *p;
+       const struct service_target *t;
        CHECK_OBJ_NOTNULL(ctx, VRT_CTX_MAGIC);
@@ -750,2 +756,16 @@ vmod_director_service(VRT_CTX, struct VPFX(dynamic_director) *obj,
        srv->last_used = ctx->now;
+
+       VRMB();
+       prios = srv->prios;
+
+       if (prios != NULL) {
+               VTAILQ_FOREACH(p, &prios->head, list) {
+                       CHECK_OBJ_NOTNULL(p, SERVICE_PRIO_MAGIC);
+                       VTAILQ_FOREACH(t, &p->targets, list) {
+                               CHECK_OBJ_NOTNULL(t, SERVICE_TARGET_MAGIC);
+                               CHECK_OBJ_NOTNULL(t->dom, DYNAMIC_DOMAIN_MAGIC);
+                               t->dom->last_used = ctx->now;
+                       }
+               }
+       }
        Lck_Unlock(&obj->mtx);

The problems with the trivial solution include:

I've sent #82 introducing minimal reference counting logic to block dynamic_domain timeout purges while referenced by a dynamic_service, thus tightly coupling them.

jake-dog commented 2 years ago

To whom it may concern,

I've updated the original issue description with simpler reproduction steps, and a better explanation of ways in which the issue can manifest.

I have also updated my proposed solution #82 which fixes several other behaviors associated with service backends.

nigoroll commented 2 years ago

Thank you, I've noticed this issue, but need to prioritize paid work. Will get back to you

nigoroll commented 1 year ago

I let this issue sit open for so long, because I knew that we had a use-after-free due to how varnish-cache worked, which we could not safely avoid in this vmod alone. Now that varnish-cache as of 7.3 has gained the proper infrastructure with reference counted backends, I have implemented the necessary changes (with a lot of ground work to build upon) as of https://github.com/nigoroll/libvmod-dynamic/commit/41f20790aedfc6d8f5cae4b3b8ffb82cd9b302c0. I can not reproduce the issue from your test case (thank you!), so would you please give it a go? You need varnish 7.3 or later. (edit: grammar, polish)

jake-dog commented 1 year ago

I regret to report that I was able to reproduce the issue. Hopefully it's user error but, if it is, I haven't figured out what I did wrong.

I started a debian:11 container, installed varnish 7.3.0, and compiled 41f20790aedfc6d8f5cae4b3b8ffb82cd9b302c0, then re-ran r81.vtc, and it returned the following assertion error:

**** v1    CLI RX|Panic at: Sat, 01 Jul 2023 18:31:29 GMT
**** v1    CLI RX|Assert error in service_resolve(), vmod_dynamic_service.c line 159:
**** v1    CLI RX|  Condition((t->dom)->magic == 0x1bfe1345) not true.
**** v1    CLI RX|version = varnish-7.3.0 revision 84d79120b6d17b11819a663a93160743f293e63f, vrt api = 17.0
**** v1    CLI RX|ident = Linux,6.2.6-76060206-generic,x86_64,-junix,-sdefault,-sdefault,-hcritbit,epoll
**** v1    CLI RX|now = 65401.230598 (mono), 1688236289.146731 (real)
**** v1    CLI RX|Backtrace:
**** v1    CLI RX|  0x5618eee6cd9c: varnishd(+0x5ad9c) [0x5618eee6cd9c]
**** v1    CLI RX|  0x5618eeeea578: varnishd(VAS_Fail+0x18) [0x5618eeeea578]
**** v1    CLI RX|  0x7f64303b53b9: ./vmod_cache/_vmod_dynamic.99f589bd7d2f24a756baa78d3445c179c9428e09634fc67b82af176f80022348(+0x93b9) [0x7f64303b53b9]
**** v1    CLI RX|  0x5618eee4da93: varnishd(VRT_DirectorResolve+0x43) [0x5618eee4da93]
**** v1    CLI RX|  0x5618eee4db8e: varnishd(+0x3bb8e) [0x5618eee4db8e]
**** v1    CLI RX|  0x5618eee4dcf0: varnishd(VDI_GetHdr+0x60) [0x5618eee4dcf0]
**** v1    CLI RX|  0x5618eee59693: varnishd(+0x47693) [0x5618eee59693]
**** v1    CLI RX|  0x5618eee588d6: varnishd(+0x468d6) [0x5618eee588d6]
**** v1    CLI RX|  0x5618eee94322: varnishd(+0x82322) [0x5618eee94322]
**** v1    CLI RX|  0x5618eee948d4: varnishd(+0x828d4) [0x5618eee948d4]

Looking through the test log, the early purge of the domain p-test-3.example.svc.cluster.local, noted in the issue summary, still occurs.

**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1688236272.014368 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1688236272.014684 0.000317 0.000317
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Lookup: 1688236272.014757 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Results: 1688236272.014947 0.000190 0.000190
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Update: 1688236272.015109 0.000352 0.000162
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Update: 1688236272.015133 0.000766 0.000449
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1688236273.525239 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1688236278.550587 5.025348 5.025348
**** v1    vsl|          0 Error           - vmod-dynamic: vcl1 d1 _http._tcp.p-test-3.example.svc.cluster.local getdns 902 (All queries for the name timed out)
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1688236287.040717 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1688236287.040961 0.000245 0.000245
**** v1    vsl|          0 VCL_Log         - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 timeout
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Done: 1688236287.041001 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Lookup: 1688236287.041046 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Results: 1688236287.041254 0.000208 0.000208
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Update: 1688236287.041366 0.000321 0.000113
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1688236287.041388 0.000672 0.000427
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1688236288.551718 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1688236288.552140 0.000422 0.000422
**** v1    vsl|          0 VCL_Log         - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 deleted
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1688236288.552416 0.000698 0.000276

I am still reviewing the code changes in 41f2079, and varnish-cache 7.3, so I might better understand where things went wrong.

nigoroll commented 1 year ago

@jake-dog thank you very much, I am going to stare at it again.

nigoroll commented 1 year ago

See issue linked above, it appears to me that reference counting can not work yet.

nigoroll commented 1 year ago

... but still this issue is independent: services need to take references on domains within vmod_dynamic. I'll have a look.

nigoroll commented 1 year ago

@jake-dog can you please try dc08a31678f5e9ce0f5f2469e6b9bcdc728e1077 and reopen this ticket if it does not fix the issue? Thank you.

jake-dog commented 1 year ago

It seems that dc08a31 resolves the use-after-free issue, but introduces unexpected behaviors I attempted to address in #82, notably:

Those outstanding issues I believe are better suited to new issues however. Great work, and thank you!

nigoroll commented 1 year ago

@jake-dog yes, it took me a de-tour to realize it in #102 but the lifetime issue should be fixed for real now. Sorry for having noticed your comment too late.

I will look at the TTL now.

nigoroll commented 1 year ago

2d51da74680e849c8411bcd6d85cd53c149b3052

jake-dog commented 1 year ago

Haven't had much time to test it but so far 2d51da7 looks great.

The only oddity I see is when running r81.vtc it logs that it deleted p-test-3, but continues to use it:

**** v1    vsl|          0 VCL_Log         - vmod-dynamic vcl1 d1 p-test-3.example.svc.cluster.local.:8893 deleted (expired)
**** v1    vsl|       1007 FetchError      b backend d1.p-test-3.example.svc.cluster.local.(127.0.0.1:8893): fail errno 111 (Connection refused)
**** v1    vsl|       1006 ReqMethod       c GET
**** v1    vsl|       1006 ReqURL          c /
**** v1    vsl|       1006 RespStatus      c 503

To be clear: this is the desired behavior for backends-- the logging is just misleading.

I will be running more tests in the next few weeks, and will certainly create an issue if I discover anything more nefarious.