varnishcache / varnish-cache

Varnish Cache source code repository
https://www.varnish-cache.org
Other
3.71k stars 378 forks source link

Assert in HSH_Lookup with Varnish 7.2.1 #3879

Open jocel1 opened 1 year ago

jocel1 commented 1 year ago

Hi,

I'm using varnish with a malloc allocator and a file allocator for static files. Settings are the following:

ExecStart=/usr/sbin/varnishd -a :80 -p workspace_session=1024 -p workspace_thread=4096 -p workspace_client=131072 -p workspace_backend=131072 -p listen_depth=16383 -p http_resp_hdr_len=65536 -p http_resp_size=98304 -p workspace_backend=131072 -p thread_pool_min=200 -p thread_pool_max=5000 -p feature=+http2 -P %t/%N/varnishd.pid -T localhost:6082 -f /etc/varnish/default.vcl -s malloc,12288m -s static=file,/tmp/varnish_storage.bin,100G

Expected Behavior

No crash

Current Behavior

I figured out varnish was complaining in the systemctl log. Active: active (running) since lun. 2022-12-05 20:39:57 CET; 1 day 18h ago

déc. 06 21:25:33 varnishd[2813]: Child (8130) Started déc. 06 21:25:33 varnishd[2813]: Child (8130) said Child starts déc. 06 21:25:33 varnishd[2813]: Child (8130) said : Error in munmap(): Invalid argument déc. 06 21:25:33 varnishd[2813]: Child (8130) said SMF.static mmap'ed 107374182400 bytes of 107374182400 déc. 07 07:59:26 varnishd[2813]: Child (8130) died signal=6 déc. 07 07:59:26 varnishd[2813]: Child (8130) Panic at: Wed, 07 Dec 2022 06:59:26 GMT Assert error in HSH_Lookup(), cache/cache_hash.c line 426: Condition((oc)->magic == 0x4d301302) not true....

So the first warning happened a day after starting the server. Then a few hours later, we can see the assert.

This is happening on a ARM architecture (t4g AWS servers) on Amazon Linux 2 + kernel 5.15.75-48.135 I'm using 4 vmod: directors, header, std & xkey I'm fetching the binary from https://packagecloud.io/install/repositories/varnishcache/varnish72/script.rpm.sh & https://github.com/varnish/varnish-modules/releases/download/0.21.0/varnish-modules-0.21.0.tar.gz

My vcl_hash is not that complicated:

sub vcl_hash {
  # Called after vcl_recv to create a hash value for the request. This is used as a key
  # to look up the object in Varnish.

  # Some generic URL manipulation, useful for all templates that follow
  # First remove the Google Analytics added parameters, useless for our backend
  if (req.url ~ "(\?|&)(mc_[a-z]+|utm_[a-z]+|eff_[a-z0-9]+|fbclid|gclid|cx|ie|cof|siteurl|ranMID|ranEAID|ranSiteID)=") {
    set req.http.hash_url = regsuball(req.url, "(\?|&)(mc_[a-z]+|utm_[a-z]+|eff_[a-z0-9]+|fbclid|gclid|cx|ie|cof|siteurl|ranMID|ranEAID|ranSiteID)=[^&]+", "\1");
    set req.http.hash_url = regsub(req.http.hash_url, "&+$", "");
    set req.http.hash_url = regsub(req.http.hash_url, "&+", "&");
    set req.http.hash_url = regsub(req.http.hash_url, "\?&", "?");
    set req.http.hash_url = regsub(req.http.hash_url, "\?$", "");
  } else {
    set req.http.hash_url = req.url;
  }

  hash_data(req.http.hash_url);

  if (req.http.host) {
    hash_data(req.http.host);
  } else {
    hash_data(server.ip);
  }

  # keep different cache between https and http to avoid infinite loop
  if (req.http.X-Forwarded-Proto) {
     hash_data(req.http.X-Forwarded-Proto);
  }

  if (req.http.X-Requested-With) {
     hash_data(req.http.X-Requested-With);
  }

  return (lookup);
}

Any idea what could be wrong?

Best regards, Jocelyn Fournier

dridi commented 1 year ago

No idea what's going on here but the file storage seems to prove highly unreliable these days on aarch64 servers.

Could you try replacing your file storage with a 100GB malloc storage instead? Why did you add another 12GB malloc storage?

And how is /tmp mounted on this system?

jocel1 commented 1 year ago

Unfortunately I "only" have 16GB of RAM on this machine. The idea was to split data in two:

/tmp in not in tmpfs but just in the standard disk partition

jocel1 commented 1 year ago

Well I'm now running without the file storage, using -s malloc,12288m -s static=malloc,1024m

However it seems I still have random panic issue:

déc. 07 20:13:03 varnishd[2789]: Child launched OK
déc. 07 20:13:03 varnishd[2790]: Child (2802) said Child starts
déc. 07 20:13:03 varnishd[2790]: Child (2802) said <jemalloc>: Error in munmap(): Invalid argument
déc. 07 20:13:03 systemd[1]: Started Varnish Cache, a high-performance HTTP accelerator.
déc. 07 20:52:10 varnishd[2790]: Child (2802) died signal=6
déc. 07 20:52:10 varnishd[2790]: Child (2802) Panic at: Wed, 07 Dec 2022 19:52:10 GMT
                                                                             Wrong turn at cache/cache_main.c:323:
                                                                             Signal 11 (Segmentation fault) received at 0xffff92ffe000 si_code 2...
déc. 07 20:52:10 varnishd[2790]: Child cleanup complete
déc. 07 20:52:10 varnishd[2790]: Child (3704) Started
déc. 07 20:52:10 varnishd[2790]: Child (3704) said Child starts
déc. 07 20:52:10 varnishd[2790]: Child (3704) said <jemalloc>: Error in munmap(): Invalid argument
jocel1 commented 1 year ago

Extra backtrace info:

Child (2802) Panic at: Wed, 07 Dec 2022 19:52:10 GMT
Wrong turn at cache/cache_main.c:323:
Signal 11 (Segmentation fault) received at 0xffff92ffe000 si_code 2
version = varnish-7.2.1 revision 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, vrt api = 16.0
ident = Linux,5.15.75-48.135.amzn2.aarch64,aarch64,-junix,-smalloc,-smalloc,-sdefault,-hcritbit,epoll
now = 2399.653874 (mono), 1670442730.667755 (real)
Backtrace:
  0x449230: /usr/sbin/varnishd() [0x449230]
  0x4c41dc: /usr/sbin/varnishd(VAS_Fail+0x58) [0x4c41dc]
  0x443d48: /usr/sbin/varnishd() [0x443d48]
  0xffffa67df820: linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffffa67df820]
  0xffffa6472d88: /lib64/libc.so.6(+0x85d88) [0xffffa6472d88]
  0x47fd3c: /usr/sbin/varnishd() [0x47fd3c]
  0x480494: /usr/sbin/varnishd() [0x480494]
  0x437150: /usr/sbin/varnishd(VFP_Suck+0xa4) [0x437150]
  0x432ca4: /usr/sbin/varnishd() [0x432ca4]
  0x435b60: /usr/sbin/varnishd() [0x435b60]
argv = {
  [0] = \"/usr/sbin/varnishd\",
  [1] = \"-a\",
  [2] = \":80\",
  [3] = \"-p\",
  [4] = \"workspace_session=1024\",
  [5] = \"-p\",
  [6] = \"workspace_thread=4096\",
  [7] = \"-p\",
  [8] = \"workspace_client=131072\",
  [9] = \"-p\",
  [10] = \"workspace_backend=131072\",
  [11] = \"-p\",
  [12] = \"listen_depth=16383\",
  [13] = \"-p\",
  [14] = \"http_resp_hdr_len=65536\",
  [15] = \"-p\",
  [16] = \"http_resp_size=98304\",
  [17] = \"-p\",
  [18] = \"workspace_backend=131072\",
  [19] = \"-p\",
  [20] = \"thread_pool_min=200\",
  [21] = \"-p\",
  [22] = \"thread_pool_max=5000\",
  [23] = \"-p\",
  [24] = \"feature=+http2\",
  [25] = \"-P\",
  [26] = \"/run/varnish/varnishd.pid\",
  [27] = \"-T\",
  [28] = \"localhost:6082\",
  [29] = \"-f\",
  [30] = \"/etc/varnish/default.vcl\",
  [31] = \"-s\",
  [32] = \"malloc,12288m\",
  [33] = \"-s\",
  [34] = \"static=malloc,1024m\",
}
pthread.self = 0xffff974671b0
pthread.name = (cache-worker)
pthread.attr = {
  guard = 4096,
  stack_bottom = 0xffff97448000,
  stack_top = 0xffff97468000,
  stack_size = 131072,
}
thr.req = NULL
thr.busyobj = 0xffff16fb0020 {
  end = 0xffff16fd0000,
  retries = 0,
  sess = 0xffff33ffc420 {
    fd = 88, vxid = 2689893,
    t_open = 1670442707.678030,
    t_idle = 1670442730.614282,
    ws = 0xffff33ffc460 {
      id = \"ses\",
      {s, f, r, e} = {0xffff33ffc4b8, +96, (nil), +832},
    },
    transport = HTTP/1 {
      state = HTTP1::Proc
    }
    client = x 50916 :80,
    local.endpoint = :80,
    local.socket = a0,
    local.ip = x:80,
    remote.ip = x:50916,
    server.ip = x:80,
    client.ip = x:50916,
  },
  worker = 0xffff974666c8 {
    ws = 0xffff97466750 {
      id = \"wrk\",
      {s, f, r, e} = {0xffff974655d0, +0, (nil), +4088},
    },
    VCL::method = BACKEND_RESPONSE,
    VCL::methods = {BACKEND_FETCH, BACKEND_RESPONSE},
  },
  vfc = 0xffff16fb4f38 {
    failed = 0,
    req = 0xffff16fb0648,
    resp = 0xffff16fb0ac0,
    wrk = 0xffff974666c8,
    oc = 0xffff6b24c340,
    filters = {
      V1F_STRAIGHT = 0xffff16fcd4b0 {
        priv1 = 0xffff16fb5448,
        priv2 = 117346,
        closed = 0
      },
    },
    obj_flags = 0x0,
  },
  filter_list = \"\",
  ws = 0xffff16fb0060 {
    id = \"bo\",
    {s, f, r, e} = {0xffff16fb4f80, +99984, (nil), +110712},
  },
  ws_bo = 0xffff16fb5430,
  http[bereq] = 0xffff16fb0648 {
    ws = 0xffff16fb0060 {
      [Already dumped, see above]
    },
    hdrs {
      \"GET\",
      \"/2436994-large_default/vintage-multicolor-floral-murano-glass-sommerso-vase-italy-1970s.jpg\",
      \"HTTP/1.1\",
      \"X-Forwarded-Proto: https\",
      \"X-Forwarded-Port: 443\",
      \"X-Amzn-Trace-Id: Root=1-6390eeea-3f4ea97408e44db20ed02120\",
      \"x-default-isocode: US\",
      \"x-default-language: en\",
      \"cf-ray: 775fccd8297cacf8-ATL\",
      \"cf-visitor: {\"scheme\":\"https\"}\",
      \"accept: image/webp,image/png,image/svg+xml,image/*;q=0.8,video/*;q=0.8,*/*;q=0.5\",
      \"user-agent: Mozilla/5.0 (iPhone; CPU iPhone OS 15_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/238.1.487893381 Mobile/15E148 Safari/604.1\",
      \"accept-language: en-US,en;q=0.9\",
      \"referer: https://www.design-market.us/248098-vintage-multicolor-floral-murano-glass-sommerso-vase-italy-1970s.html?redirected=true\",
      \"cf-connecting-ip: x\",
      \"cf-pseudo-ipv4: x\",
      \"cf-ipcountry: US\",
      \"cdn-loop: cloudflare\",
      \"X-Forwarded-For: x, x, x\",
      \"Via: 1.1 ip-x.eu-west-1.compute.internal (Varnish/7.2)\",
      \"host: www.design-market.us\",
      \"X-ETag: noxetag\",
      \"Accept-Encoding: gzip\",
      \"hash_url: /2436994-large_default/vintage-multicolor-floral-murano-glass-sommerso-vase-italy-1970s.jpg\",
      \"X-Varnish: 1249115\",
    },
  },
  http[beresp] = 0xffff16fb0ac0 {
    ws = 0xffff16fb0060 {
      [Already dumped, see above]
    },
    hdrs {
      \"HTTP/1.1\",
      \"200\",
      \"OK\",
      \"Server: nginx/1.22.0\",
      \"Date: Wed, 07 Dec 2022 19:52:10 GMT\",
      \"Content-Type: image/jpeg\",
      \"Content-Length: 117346\",
      \"Connection: keep-alive\",
      \"Strict-Transport-Security: max-age=15552001; includeSubDomains;\",
      \"Last-Modified: Wed, 12 Oct 2022 18:45:39 GMT\",
      \"Accept-Ranges: bytes\",
      \"Expires: Fri, 06 Jan 2023 19:52:10 GMT\",
      \"X-Content-Type-Options: nosniff\",
      \"X-XSS-Protection: 1; mode=block\",
      \"X-Frame-Options: sameorigin\",
      \"x-storage: static\",
      \"cache-control: public, max-age=2592000\",
      \"static-url: /2436994-large_default/vintage-multicolor-floral-murano-glass-sommerso-vase-italy-1970s.jpg\",
      \"X-UA-Device: \",
      \"X-Device-Type: \",
      \"X-Country-Code: \",
      \"X-ETag: noxetag\",
      \"Accept-Language: en-US,en;q=0.9\",
    },
  },
  objcore[fetch] = 0xffff6b24c340 {
    refcnt = 3,
    flags = {},
    exp_flags = {refd},
    boc = 0xffff4058cf40 {
      refcnt = 3,
      state = stream,
      vary = (nil),
      stevedore_priv = (nil),
    },
    exp = {1670442730.667573, 2592000.000000, 120.000000, 0.000000},
    objhead = 0xffff1205b060,
    stevedore = 0xffffa5bf0160 (malloc static) {
      Simple = 0xffff15d61f00,
      Obj = 0xffff5d156918 {priv=0xffff5d156910, ptr=0xffff15d61f00, len=744, space=744},
      LEN = 0x0...0,
      VXID = 0x00130f5b,
      FLAGS = 0x08,
      GZIPBITS = 0x0...0,
      LASTMODIFIED = 0x41d8d1c2d4c00000,
      VARY = {len=0, ptr=(nil)},
      HEADERS = {len=624, ptr=0xffff15d61f78},
      Body = 0xffff653920e8 {priv=0xffff653920e0, ptr=0xffff92ff0000, len=0, space=58673},
    },
  },
  http_conn = 0xffff16fb5448 {
    fd = 120 (@0xffff46837944),
    doclose = null(Not Closing)
    ws = 0xffff16fb0060 {
      [Already dumped, see above]
    },
    {rxbuf_b, rxbuf_e} = {0xffff16fb54b0, 0xffff16fb5807},
    {pipeline_b, pipeline_e} = {0xffff16fb5807, 0xffff16fcd4b0},
    content_length = 117346,
    body_status = length,
    first_byte_timeout = 300.000000,
    between_bytes_timeout = 300.000000,
  },
  flags = {do_stream},
  director_req = 0xffffa5a40af0 {
    cli_name = boot.front,
    admin_health = probe, changed = 1670440383.896565,
    type = round-robin {
    },
  },
  director_resp = 0xffffa5a40ac0 {
    cli_name = boot.nlb_ip1,
    admin_health = healthy, changed = 1670440383.895000,
    type = backend {
      conn_pool = 0xffffa5af0600 {
        ident = 0xfd480fbdd018b14c3223a30af2cadc2e300897dfdf1263c3d76a6b7fd3d2069c,
        vrt_endpoint = 0xffffa5a62ea0 {
          ipv4 = 127.0.0.1, port = 8080,
        },
      },
      hosthdr = localhost,
      n_conn = 21,
    },
  },
  vcl[vcl] = 0xffffa5af0380 {
    name = \"boot\",
    busy = 103,
    discard = 0,
    state = auto,
    temp = warm,
    conf = 0xffff95b7e650 {
      syntax = \"40\",
      srcname = {
        [0] = \"/etc/varnish/default.vcl\",
        [1] = \"/etc/varnish/backend.vcl\",
        [2] = \"/etc/varnish/devicedetect.vcl\",
        [3] = \"<builtin>\",
      },
      instances = {
        \"front\" = 0xffffa5a72e00,
      },
    },
  },
  vpi = 0xffff974655c0 {
    handling (VCL::return) = 0x0 (none),
    ref = 161,
    vpi_ref = 0xffff95b7dc90 {
      source = 0 (\"/etc/varnish/default.vcl\"),
      offset = 21167,
      line = 525,
      pos = 3,
      src = \"set beresp.http.Accept-Language = bereq.[...]\"
    },
  },
},
thr.worker = 0xffff974666c8 {
  [Already dumped, see above]
},
vmods = {
  std = {0xffffa5be01d0, Varnish 7.2.1 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, 0.0},
  directors = {0xffffa5be0240, Varnish 7.2.1 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, 0.0},
  header = {0xffffa5be02b0, Varnish 7.2.1 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, 16.0},
  xkey = {0xffffa5be0320, Varnish 7.2.1 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, 0.0},
},
pools = {
  pool = 0xffff9de40100 {
    nidle = 173,
    nthr = 200,
    lqueue = 0
  },
  pool = 0xffff9de40740 {
    nidle = 176,
    nthr = 200,
    lqueue = 0
  },
},
jocel1 commented 1 year ago

FYI I've just upgraded the jemalloc lib (from 3.6.0 to 5.3.0), I'm monitoring if it has any impact on this issue (it should at least fix the : Error in munmap(): Invalid argument error)

dridi commented 1 year ago

I would also like to be interested to learn how it fares without jemalloc at all, see #3881.

jocel1 commented 1 year ago

Thx, I'll also try without jemalloc at all!

jocel1 commented 1 year ago

FYI Varnish has been running with the original -s malloc,12288m -s static=file,/tmp/varnish_storage.bin,100G and jemalloc 5.3.0 without any issue so far!

dridi commented 1 year ago

Could you please try another build with jemalloc 3.6?

./configure --with-jemalloc --with-unwind --enable-debugging-symbols

And then run varnishd with this jemalloc configuration in your environment:

export MALLOC_CONF="abort:true,redzone:true,junk:true"
jocel1 commented 1 year ago

Hi @Dridi

Just to be sure, release 7.2.1 should be fine for this test, no need to use develop?

jocel1 commented 1 year ago

Hi @Dridi

FYI it's not possible to start varnish with jemalloc 3.6 & the debug config on arm because of the munmap issue:

déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Version: varnish-7.2.1 revision NOGIT déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Platform: Linux,5.10.157-139.675.amzn2.aarch64,aarch64,-junix,-smalloc,-sfile,-sdefault,-hcritbit déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Child (22295) Started déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Child (-1) said Child starts déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Child (-1) said : Error in munmap(): Invalid argument déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Child failed on launch déc

With jemalloc 5.3.0 it starts properly (but there's other issues after a while with instability / high CPU usage)

jocel1 commented 1 year ago

Hi @Dridi

I'm now evaluating the 7.2.1 version compiled without jemalloc (+ varnish-modules) on a arm AWS graviton2 processor (compiled with gcc10 & mcpu=neoverse-n1 flags). I'll let you know if it has any instability.

jocel1 commented 1 year ago

Hi @Dridi

FYI 7.2.1 without jemalloc using patch from #3881 seems to work pretty well on AWS arm. I see a few spike in CPU usage during aggressive google crawls I didn't have before (x86_64 instances), but response time are good & stable so it's not really an issue (here it's an x2gd.large with 2 cores and 32GB of RAM)

image image
jocel1 commented 1 year ago

I confirm after more than 1 week of use everything is still really stable on ARM without jemalloc.