grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Aggregation problem with SUM #418

Closed loguido closed 4 years ago

loguido commented 4 years ago

I've got a strange issue when i try to use SUM as aggregation formula. Sometimes it works and sometimes it use AVERAGE that is my default. Restarting carbon-c-relay usually fix the issue for a while. This is the part of the config of aggregator:

aggregate ^servers.(.+).(.+)-(canary|stable)-[^.]+.(.+).(.+).([cC]ount) every 60 seconds expire after 90 seconds compute sum write to servers-aggr.\1.aggregated.\4.\5.\6 send to local_carbon ;

aggregate ^servers.(.+|.+).(.+|.+)-(canary|stable)-[^.]+.(.+).(.+).([^.]+) every 60 seconds expire after 90 seconds compute average write to servers-aggr.\1.aggregated.\4.\5.\6 send to local_carbon ;

Any known problem using SUM ? thank you

grobian commented 4 years ago

I think the problem is that both regexes match the same data, and you produce the same metric output for them, so there's a race whichever metric gets emitted last (and in the storage layer writes the final value).

% ./relay -t -f issue418
[2020-07-21 09:45:02] starting carbon-c-relay v3.7.1 (56fc93-dirty), pid=4863
configuration:
    relay hostname = nut.cheops.bitzolder.nl
    workers = 8
    send batch size = 2500
    server queue size = 25000
    server max stalls = 4
    listen backlog = 32
    server connection IO timeout = 600ms
    idle connections disconnect timeout = 10m
    configuration = issue418

parsed configuration follows:
listen
    type linemode
        2003 proto tcp
        2003 proto udp
        /tmp/.s.carbon-c-relay.2003 proto unix
    ;

statistics
    submit every 60 seconds
    prefix with carbon.relays.nut_cheops_bitzolder_nl
    ;

cluster local_carbon
    forward
        127.0.0.1:12345
    ;

aggregate ^servers.(.+).(.+)-(canary|stable)-[^.]+.(.+).(.+).([cC]ount)
    every 60 seconds
    expire after 90 seconds
    timestamp at end of bucket
    compute sum write to
        servers-aggr.\1.aggregated.\4.\5.\6 
    send to local_carbon
    ;
aggregate ^servers.(.+|.+).(.+|.+)-(canary|stable)-[^.]+.(.+).(.+).([^.]+)
    every 60 seconds
    expire after 90 seconds
    timestamp at end of bucket
    compute average write to
        servers-aggr.\1.aggregated.\4.\5.\6 
    send to local_carbon
    ;

servers.foo.bar-stable-bla.foo.bar.count 12 12
aggregation
    ^servers.(.+).(.+)-(canary|stable)-[^.]+.(.+).(.+).([cC]ount) (regex) -> servers.foo.bar-stable-bla.foo.bar.count
    sum(servers-aggr.\1.aggregated.\4.\5.\6) -> servers-aggr.foo.b.aggregated.foo.b.r.count 12 12
    forward(local_carbon)
        127.0.0.1:12345
aggregation
    ^servers.(.+|.+).(.+|.+)-(canary|stable)-[^.]+.(.+).(.+).([^.]+) (regex) -> servers.foo.bar-stable-bla.foo.bar.count
    average(servers-aggr.\1.aggregated.\4.\5.\6) -> servers-aggr.foo.b.aggregated.foo.bar.c.u.t 12 12
    forward(local_carbon)
        127.0.0.1:12345
grobian commented 4 years ago

perhaps you're missing the "stop" label in your first aggregation?

aggregate ^servers.(.+).(.+)-(canary|stable)-[^.]+.(.+).(.+).([cC]ount)
    every 60 seconds
    expire after 90 seconds
    timestamp at end of bucket
    compute sum write to
        servers-aggr.\1.aggregated.\4.\5.\6
    send to local_carbon
    stop
    ;
aggregate ^servers.(.+|.+).(.+|.+)-(canary|stable)-[^.]+.(.+).(.+).([^.]+)
    every 60 seconds
    expire after 90 seconds
    timestamp at end of bucket
    compute average write to
        servers-aggr.\1.aggregated.\4.\5.\6
    send to local_carbon
    ;

servers.foo.bar-stable-bla.foo.bar.count 12 12
aggregation
    ^servers.(.+).(.+)-(canary|stable)-[^.]+.(.+).(.+).([cC]ount) (regex) -> servers.foo.bar-stable-bla.foo.bar.count
    sum(servers-aggr.\1.aggregated.\4.\5.\6) -> servers-aggr.foo.b.aggregated.foo.b.r.count 12 12
    forward(local_carbon)
        127.0.0.1:12345
    stop

servers.foo.bar-stable-bla.foo.bar.sum 13 14
aggregation
    ^servers.(.+|.+).(.+|.+)-(canary|stable)-[^.]+.(.+).(.+).([^.]+) (regex) -> servers.foo.bar-stable-bla.foo.bar.sum
    average(servers-aggr.\1.aggregated.\4.\5.\6) -> servers-aggr.foo.b.aggregated.foo.bar.s.m 13 14
    forward(local_carbon)
        127.0.0.1:12345
loguido commented 4 years ago

Oh, you're right, I was just looking at this. I had not used the stop because i was also writing the original metric at the end of the config :

match * send to local_carbon stop ;

So i moved this directive at the beginning (without stop) and used stop in aggregation sections as you suggested.

thank you !

grobian commented 4 years ago

happy to help :)