Closed dugwood closed 6 years ago
First of all, why are you expecting MAIN.bans
to be around 1M? Are you actually having that many bans active? Otherwise I would expect it to be quite low.
The actual issue is interesting. It looks like an underflow, but at first sight we are only changing the bans counter under a lock, so this should not happen. Still checking...
just to be on the safe side, can you please post your vmod list?
@nigoroll you're right, it shouldn't be that high. I thought of it as a counter, in the past I've used some ban variables that were counters. So yes, at most I would be around 100, sometimes around 5000 when I run a unique purge mechanism for many database objects (each would run it's own HTTP page ban rule).
Don't know how to list all vmods, but here's an excerpt of my VCL:
vcl 4.1;
import std;
import directors;
import geoip2;
[...]
include "datadome.vcl";
[...]
In datadome.vcl
:
import data_dome_shield;
[...]
So I use (against my will, client is king...) vmods from MaxMind and DataDome.
Thanks.
I see that another metric is wrong:
VBE.boot.server1.happy 18446744073709551615 . Happy health probes
I don't recall the issue before using DataDome... but I'm not 100% sure either :-)
MAIN.bans
is a gauge so we're likely looking at an underflow. Since ban evaluation is done in parallel, out of lock, and as a result possibly multiple times (the same bans) for a given object (as @mbgrydeland explained to me recently), maybe we run into situations where the counter is also decremented more than once per ban.
@Dridi please check the code where this counter is touched.
first of all, there is a very obvious indication towards the root cause:
MAIN.bans_added 509494 0.74 Bans added
MAIN.bans_deleted 509536 0.74 Bans deleted
we cannot possibly delete more bans than we added, and both counters are also incremented under the ban_mtx
with the exception of a ban_reload (but that only happens for persistent storage during init).
So I am currently pondering if, for some reason, ban_cleantail
could work some bans more than once.
BTW, happy is a bitfield, so the value is ok
Oh, you're right, I've the same value on another cluster (which doesn't have the current issue BTW, but I don't know why). So it's ffffffffff
when viewed as top
, that's why I didn't see it at first.
hypothesis: For bans, we access the VSC_C_main
counters directly, while for worker threads we sum them under wstat_mtx
. So VSC_main_Summ
could race with the ban lurker and undo changes because the actual counter ops are not atomic.
The fix probably is to use the normal pool stats for the ban lurker also
To be continued in the pull request which I just opened
Expected Behavior
MAIN.bans
counter should be around 1M (approximately)Current Behavior
Counter goes up to a very big value:
But the counter is still working after the big jump: next second the value went from 1579 to 1580.
Possible Solution
No idea, sorry.
Steps to Reproduce (for bugs)
I clearly don't know why it's doing this. Just observed this on a few servers, after a while (at least 2 days)
Context
I'm monitoring bans to see if there's no congestion, so the alert goes off and is never removed. I'm using the following test:
MAIN.bans - MAIN.bans_completed > 1000
Your Environment