varnishstat MAIN.bans counter is wrong

dugwood commented 6 years ago

Expected Behavior

MAIN.bans counter should be around 1M (approximately)

Current Behavior

Counter goes up to a very big value:

MAIN.bans                  18446744073709551579          .   Count of bans
MAIN.bans_completed                           0          .   Number of bans marked 'completed'
MAIN.bans_obj              18446744073709551572          .   Number of bans using obj.*
MAIN.bans_req                                 0          .   Number of bans using req.*
MAIN.bans_added                          509494         0.74 Bans added
MAIN.bans_deleted                        509536         0.74 Bans deleted
MAIN.bans_tested                        1268737         1.84 Bans tested against objects (lookup)
MAIN.bans_obj_killed                       6377         0.01 Objects killed by bans (lookup)
MAIN.bans_lurker_tested             19320875202     27991.41 Bans tested against objects (lurker)
MAIN.bans_tests_tested                303786326       440.12 Ban tests tested against objects (lookup)
MAIN.bans_lurker_tests_tested       19320879749     27991.42 Ban tests tested against objects (lurker)
MAIN.bans_lurker_obj_killed              143565         0.21 Objects killed by bans (lurker)
MAIN.bans_lurker_obj_killed_cutoff            0         0.00 Objects killed by bans for cutoff (lurker)
MAIN.bans_dups                           414859         0.60 Bans superseded by other bans
MAIN.bans_lurker_contention                   0         0.00 Lurker gave way for lookup
MAIN.bans_persisted_bytes              81466732          .   Bytes used by the persisted ban lists
MAIN.bans_persisted_fragmentation      81466716          .   Extra bytes in persisted ban lists due to fragmentation

But the counter is still working after the big jump: next second the value went from 1579 to 1580.

Possible Solution

No idea, sorry.

Steps to Reproduce (for bugs)

I clearly don't know why it's doing this. Just observed this on a few servers, after a while (at least 2 days)

Context

I'm monitoring bans to see if there's no congestion, so the alert goes off and is never removed. I'm using the following test: MAIN.bans - MAIN.bans_completed > 1000

Your Environment

Version used: 6.0.0-1~stretch
Operating System and version: Debian Stretch (stable)
Source of binary packages used: packagecloud

nigoroll commented 6 years ago

First of all, why are you expecting MAIN.bans to be around 1M? Are you actually having that many bans active? Otherwise I would expect it to be quite low. The actual issue is interesting. It looks like an underflow, but at first sight we are only changing the bans counter under a lock, so this should not happen. Still checking...

nigoroll commented 6 years ago

just to be on the safe side, can you please post your vmod list?

dugwood commented 6 years ago

@nigoroll you're right, it shouldn't be that high. I thought of it as a counter, in the past I've used some ban variables that were counters. So yes, at most I would be around 100, sometimes around 5000 when I run a unique purge mechanism for many database objects (each would run it's own HTTP page ban rule).

Don't know how to list all vmods, but here's an excerpt of my VCL:

vcl 4.1;
import std;
import directors;
import geoip2;
[...]
include "datadome.vcl";
[...]

In datadome.vcl:

import data_dome_shield;
[...]

So I use (against my will, client is king...) vmods from MaxMind and DataDome.

Thanks.

dugwood commented 6 years ago

I see that another metric is wrong:

VBE.boot.server1.happy             18446744073709551615          .   Happy health probes

I don't recall the issue before using DataDome... but I'm not 100% sure either :-)

dridi commented 6 years ago

MAIN.bans is a gauge so we're likely looking at an underflow. Since ban evaluation is done in parallel, out of lock, and as a result possibly multiple times (the same bans) for a given object (as @mbgrydeland explained to me recently), maybe we run into situations where the counter is also decremented more than once per ban.

nigoroll commented 6 years ago

@Dridi please check the code where this counter is touched.

nigoroll commented 6 years ago

first of all, there is a very obvious indication towards the root cause:

MAIN.bans_added                          509494         0.74 Bans added
MAIN.bans_deleted                        509536         0.74 Bans deleted

we cannot possibly delete more bans than we added, and both counters are also incremented under the ban_mtx with the exception of a ban_reload (but that only happens for persistent storage during init).

So I am currently pondering if, for some reason, ban_cleantail could work some bans more than once.

nigoroll commented 6 years ago

BTW, happy is a bitfield, so the value is ok

dugwood commented 6 years ago

Oh, you're right, I've the same value on another cluster (which doesn't have the current issue BTW, but I don't know why). So it's ffffffffff when viewed as top, that's why I didn't see it at first.

nigoroll commented 6 years ago

hypothesis: For bans, we access the VSC_C_main counters directly, while for worker threads we sum them under wstat_mtx. So VSC_main_Summ could race with the ban lurker and undo changes because the actual counter ops are not atomic. The fix probably is to use the normal pool stats for the ban lurker also

nigoroll commented 6 years ago

To be continued in the pull request which I just opened

varnishcache / varnish-cache