varnish / varnish-modules

Collection of Varnish Cache modules (vmods) by Varnish Software
Other
185 stars 86 forks source link

XKEY.g_ochead_bytes underflows #159

Closed xordspar0 closed 4 years ago

xordspar0 commented 4 years ago

Over time while Varnish is running, the XKEY.g_ochead_bytes counter sometimes underflows.

image

In this chart, the far-right value is 18446744073708882000, which should be -669616 except that the counter is an unsigned 64-bit integer.

Note that g_ochead_bytes is a DEBUG level counter. To see it in varnish stat you need to press v a couple times.

gquintard commented 4 years ago

the counter updates are not mutex'd, so that's expected

you can either lock around the updates, or split them between alloc and free and let the user compute the actual value, guarding against "underflows" (more free than alloc)

benzvan commented 4 years ago

I think I'd prefer locking around the updates. I'm also seeing this happen with MAIN.n_object in both 6.0.4 and 6.4.0, so possibly the same issue?

gquintard commented 4 years ago

same behavior, wouldn't call it an issue has it was designed that way, but yes.

be careful with locking as it will impact performance

dridi commented 4 years ago

Updates do happen under vmod_xkey's mutex for tree operations, so maybe they aren't balanced?

At first glance though, they seem to be.

rezan commented 4 years ago

We had discussed this on IRC a bit. MAIN.n_object is correct, its just that the worker operations hit the sumfunction out of order. So underflowed readings are possible and the counts will return to normal after everything is idle.

dridi commented 4 years ago

I was looking at the wrong gauge, and it looks like g_ochead_bytes is doing something funky with regards to pooling, but it looks otherwise balanced (at first glance).

benzvan commented 4 years ago

I did a little more investigation of the symptoms, but I haven't dug into the code yet. This graph appears to show the g_bytes metric is decreasing more than it is increasing.

Screen Shot 2020-03-19 at 07 35 47
rezan commented 4 years ago

Glancing at the code, you are accounting differently here by skipping the pool allocation:

https://github.com/varnish/varnish-modules/blob/cd8d5c9f9ed2624770490e8f94dce6ee9fea3703/src/vmod_xkey.c#L224-L233

Everywhere else you do accounting including the pools. This would definitely cause an underflow.

benzvan commented 4 years ago

I just took a quick look at our metrics and oc_bytes is definitely the one that's underflowing. We'll try to figure out how to write a test for this and make a PR.