varnishcache / varnish-cache

Varnish Cache source code repository
https://www.varnish-cache.org
Other
3.63k stars 372 forks source link

NULL OA_VARY Attribute - Assert error in VRY_Match(), cache/cache_vary.c line 294: Condition((vary) != 0) not true. #2319

Closed obliadp closed 6 years ago

obliadp commented 7 years ago

Expected Behavior

Current Behavior

Occassionaly my varnishes empty themselves and this assert is logged:

Panic at: Fri, 28 Apr 2017 10:24:26 GMT
Assert error in VRY_Match(), cache/cache_vary.c line 294:
  Condition((vary) != 0) not true.
version = varnish-5.1.2 revision 6ece695, vrt api = 6.0
ident = Linux,3.10.0-514.6.2.el7.x86_64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
now = 1993938.449226 (mono), 1493375065.668808 (real)
Backtrace:
  0x439787: /usr/sbin/varnishd() [0x439787]
  0x445494: /usr/sbin/varnishd(VRY_Match+0x2e4) [0x445494]
  0x42edd1: /usr/sbin/varnishd(HSH_Lookup+0x371) [0x42edd1]
  0x43cbbf: /usr/sbin/varnishd() [0x43cbbf]
  0x43df81: /usr/sbin/varnishd(CNT_Request+0x741) [0x43df81]
  0x45be47: /usr/sbin/varnishd() [0x45be47]
  0x453059: /usr/sbin/varnishd() [0x453059]
  0x4534db: /usr/sbin/varnishd() [0x4534db]
  0x7fa5a220bdc5: /lib64/libpthread.so.0(+0x7dc5) [0x7fa5a220bdc5]
  0x7fa5a1f3a73d: /lib64/libc.so.6(clone+0x6d) [0x7fa5a1f3a73d]
thread = (cache-worker)
thr.req = 0x7fa533806020 {
  vxid = 7409533, transport = HTTP/1 {
    state = HTTP1::Proc
  }
  step = R_STP_LOOKUP,
  req_body = R_BODY_NONE,
  restarts = 0, esi_level = 0,
  sp = 0x7fa46a800020 {
    fd = 91, vxid = 9128606,
    t_open = 1493375012.911498,
    t_idle = 1493375054.675609,
    transport = HTTP/1 {
      state = HTTP1::Proc
    }
    client = xxx.xxx.xxx.xxx 57212,
    privs = 0x7fa46a800088 {
    },
  },
  worker = 0x7fa591a1cc30 {
    stack = {0x7fa591a1d000 -> 0x7fa591a11000},
    ws = 0x7fa591a1ce30 {
      id = \"wrk\",
      {s, f, r, e} = {0x7fa591a1c3d0, +0, (nil), +2040},
    },
    VCL::method = BACKEND_ERROR,
    VCL::return = deliver,
    VCL::methods = {},
  },
  ws = 0x7fa533806208 {
    id = \"req\",
    {s, f, r, e} = {0x7fa533808008, +832, +1040368, +1040368},
  },
  http_conn = 0x7fa533806130 {
    fd = 91 (@0x7fa46a800038),
    doclose = NULL,
    ws = 0x7fa533806208 {
      [Already dumped, see above]
    },
    {rxbuf_b, rxbuf_e} = {0x7fa533808008, 0x7fa5338082e8},
    {pipeline_b, pipeline_e} = {(nil), (nil)},
    content_length = -1,
    body_status = none,
    first_byte_timeout = 0.000000,
    between_bytes_timeout = 0.000000,
  },
  http[req] = 0x7fa5338062a0 {
    ws = 0x7fa533806208 {
      [Already dumped, see above]
    },
    hdrs {
      \"GET\",
      \"/[xxx]",
      \"HTTP/1.1\",
      \"Host: [xxx]\",
      \"Accept-Encoding: gzip\",
      \"CF-IPCountry: NO\",
      \"CF-RAY: [xxx]\",
      \"X-Forwarded-Proto: http\",
      \"CF-Visitor: {\"scheme\":\"http\"}\",
      \"Upgrade-Insecure-Requests: 1\",
      \"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\",
      \"User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1\",
      \"Referer: https://www.google.no/\",
      \"DNT: 1\",
      \"Accept-Language: nb-no\",
      \"CF-Connecting-IP: xxx.xxx.xxx.xxx\",
      \"X-Cloud-Trace-Context: [xxx]\",
      \"Via: 1.1 google\",
      \"Connection: Keep-Alive\",
      \"X-Forwarded-For: xxx.xxx.xxx.xxx, xxx.xxx.xxx.xxx, xxx.xxx.xxx.xxx, xxx.xxx.xxx.xxx\",
    },
  },
  vcl = {
    name = \"boot\",
    busy = 142,
    discard = 0,
    state = auto,
    temp = warm,
    conf = {
      srcname = {
        \"/etc/varnish/default.vcl\",
        \"Builtin\",
        \"/etc/varnish/backends/xxx.vcl\",
        \"/etc/varnish/backends/xxx.vcl\",
        \"/etc/varnish/backends/xxx.vcl\",
        \"/etc/varnish/backends/xxx.vcl\",
        \"/etc/varnish/backends/xxx.vcl\",
        \"/etc/varnish/backends/xxx.vcl\",
        \"/etc/varnish/extensions/xxx.vcl\",
        \"/etc/varnish/handlers/xxx.vcl\",
        \"/etc/varnish/handlers/xxx.vcl\",
        \"/etc/varnish/handlers/xxx.vcl\",
        \"/etc/varnish/handlers/xxx.vcl\",
        \"/etc/varnish/handlers/xxx.vcl\",
        \"/etc/varnish/handlers/xxx.vcl\",
        \"/etc/varnish/handlers.vcl\",
        \"/etc/varnish/errors.vcl\",
        \"/etc/varnish/backend_response.vcl\",
        \"/etc/varnish/deliver.vcl\",
      },
    },
  },
  vmods = {
    directors = {Varnish 5.1.2 6ece695, 6.0},
    std = {Varnish 5.1.2 6ece695, 6.0},
  },
  flags = {
  },
},
thr.busyobj = (nil) {
}

This might be related to the old https://varnish-cache.org/trac/ticket/1304 , but I've tried increasing all workspaces (except thread) to rather alarming sizes:

accept_filter              -
acceptor_sleep_decay       0.9 (default)
acceptor_sleep_incr        0.000 [seconds] (default)
acceptor_sleep_max         0.050 [seconds] (default)
auto_restart               on [bool] (default)
backend_idle_timeout       60.000 [seconds] (default)
ban_cutoff                 0 [bans] (default)
ban_dups                   on [bool] (default)
ban_lurker_age             60.000 [seconds] (default)
ban_lurker_batch           1000 (default)
ban_lurker_holdoff         0.010 [seconds] (default)
ban_lurker_sleep           0.010 [seconds] (default)
between_bytes_timeout      240.000 [seconds]
cc_command                 exec gcc -std=gnu99 -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic -Wall -Werror -Wno-error=unused-result -pthread -fpic -shared -Wl,-x -o %o %s (default)
cli_buffer                 8k [bytes] (default)
cli_limit                  1M [bytes]
cli_timeout                60.000 [seconds] (default)
clock_skew                 10 [seconds] (default)
clock_step                 1.000 [seconds] (default)
connect_timeout            7.000 [seconds]
critbit_cooloff            180.000 [seconds] (default)
debug                      none (default)
default_grace              10.000 [seconds] (default)
default_keep               0.000 [seconds] (default)
default_ttl                120.000 [seconds] (default)
feature                    none (default)
fetch_chunksize            16k [bytes] (default)
fetch_maxchunksize         0.25G [bytes] (default)
first_byte_timeout         240.000 [seconds]
gzip_buffer                32k [bytes] (default)
gzip_level                 6 (default)
gzip_memlevel              8 (default)
h2_rx_window_increment     1M [bytes] (default)
h2_rx_window_low_water     10M [bytes] (default)
http_gzip_support          on [bool] (default)
http_max_hdr               64 [header lines] (default)
http_range_support         on [bool] (default)
http_req_hdr_len           64k [bytes]
http_req_size              32k [bytes] (default)
http_resp_hdr_len          64k [bytes]
http_resp_size             32k [bytes] (default)
idle_send_timeout          60.000 [seconds] (default)
listen_depth               4096 [connections]
lru_interval               2.000 [seconds] (default)
max_esi_depth              5 [levels] (default)
max_restarts               4 [restarts] (default)
max_retries                4 [retries] (default)
nuke_limit                 1000 [allocations]
pcre_match_limit           10000 (default)
pcre_match_limit_recursion 20 (default)
ping_interval              3 [seconds] (default)
pipe_timeout               3600.000 [seconds]
pool_req                   10,100,10 (default)
pool_sess                  10,100,10 (default)
pool_vbo                   10,100,10 (default)
prefer_ipv6                off [bool] (default)
rush_exponent              3 [requests per request] (default)
send_timeout               600.000 [seconds] (default)
shm_reclen                 255b [bytes] (default)
shortlived                 10.000 [seconds] (default)
sigsegv_handler            on [bool] (default)
syslog_cli_traffic         off [bool]
tcp_fastopen               off [bool] (default)
tcp_keepalive_intvl        75.000 [seconds] (default)
tcp_keepalive_probes       9 [probes] (default)
tcp_keepalive_time         7200.000 [seconds] (default)
thread_pool_add_delay      0.000 [seconds] (default)
thread_pool_destroy_delay  1.000 [seconds] (default)
thread_pool_fail_delay     0.200 [seconds] (default)
thread_pool_max            4000 [threads]
thread_pool_min            100 [threads] (default)
thread_pool_reserve        0 [threads] (default)
thread_pool_stack          48k [bytes] (default)
thread_pool_timeout        300.000 [seconds] (default)
thread_pools               2 [pools] (default)
thread_queue_limit         20 (default)
thread_stats_rate          10 [requests] (default)
timeout_idle               5.000 [seconds] (default)
timeout_linger             0.150 [seconds]
vcc_allow_inline_c         off [bool] (default)
vcc_err_unref              off [bool]
vcc_unsafe_path            on [bool] (default)
vcl_cooldown               600.000 [seconds] (default)
vcl_dir                    /etc/varnish:/usr/share/varnish/vcl (default)
vcl_path                   /etc/varnish:/usr/share/varnish/vcl (default)
vmod_dir                   /usr/lib64/varnish/vmods (default)
vmod_path                  /usr/lib64/varnish/vmods (default)
vsl_buffer                 4k [bytes] (default)
vsl_mask                   -VCL_trace,-WorkThread,-Hash,-VfpAcct (default)
vsl_reclen                 255b [bytes] (default)
vsl_space                  80M [bytes] (default)
vsm_free_cooldown          60.000 [seconds] (default)
vsm_space                  1M [bytes] (default)
workspace_backend          1M [bytes]
workspace_client           1M [bytes]
workspace_session          5M [bytes]
workspace_thread           2k [bytes] (default)

Your Environment

fgsch commented 7 years ago

Is this reproducible?

nigoroll commented 7 years ago

there are two places in HSH_Lookup where we call VRY_Match, this is the one where we look into an existing object:

   0x000000000042edab <+843>:   callq  0x437bd0 <ObjHasAttr>
   0x000000000042edb0 <+848>:   test   %eax,%eax
   0x000000000042edb2 <+850>:   je     0x42edd9 <HSH_Lookup+889>
   0x000000000042edb4 <+852>:   xor    %ecx,%ecx
   0x000000000042edb6 <+854>:   mov    $0x5,%edx
   0x000000000042edbb <+859>:   mov    %r15,%rsi
   0x000000000042edbe <+862>:   mov    %r12,%rdi
   0x000000000042edc1 <+865>:   callq  0x437b20 <ObjGetAttr>
   0x000000000042edc6 <+870>:   mov    %rbx,%rdi
   0x000000000042edc9 <+873>:   mov    %rax,%rsi
   0x000000000042edcc <+876>:   callq  0x4451b0 <VRY_Match>
   0x000000000042edd1 <+881>:   test   %eax,%eax

->

        if (ObjHasAttr(wrk, oc, OA_VARY)) {
            vary = ObjGetAttr(wrk, oc, OA_VARY, NULL);
            if (!VRY_Match(req, vary))
                continue;
        }

So the interesting question is why we have a NULL OA_VARY Attribute. The place where we add it is vbf_beresp2obj but there we got asserts that vary != NULL IFF varyl > 0. So while ObjSetAttr does allow setting attributes with NULL values, I don't see (yet) why this is happening. I'd say the issue is genuine, #2130 was basically the same and closed without a real change (but the assert added which we're hitting here).

nigoroll commented 7 years ago

I made no progress staring at the code. @obliadp any chance to get hold of a core dump?

obliadp commented 7 years ago

No core dump available atm I'm afraid, I'll try to produce one. Meanwhile I've collected a panic, relevant varnishlog and varnishgather on filebin. Left the url with Espen as I couldn't find @nigoroll on irc.

obliadp commented 7 years ago

We've reverted to 4.1 for production, but might this still be interesting to look into?

espebra commented 7 years ago

@nigoroll is slink on irc. The url has been delivered.

nigoroll commented 7 years ago

@obliadp yes I definitely want to see this one resolved and I also received the tarball via espen. But unfortunately that does not give me any more clues regarding the root cause. A core dump would still be the most helpful, but I understand that you need to provide a stable service.

nigoroll commented 6 years ago

I'm back at this and think the problem here is that backend_error frees and recreated a storage object combined with the fact that freeobj clears the object, but not oa_present.