splunk / splunk-connect-for-syslog

Splunk Connect for Syslog
Apache License 2.0
148 stars 107 forks source link

grouping-by / aggregate feature log loss #2364

Closed olivierpas closed 3 months ago

olivierpas commented 4 months ago

Was the issue replicated by support? No What is the sc4s version ? 3.22.1 Is there a pcap available? No Is the issue related to the environment of the customer or Software related issue? No Is it related to Data loss, please explain ? Protocol? Hardware specs? Yes, i am using grouping-by and the sum of all "repetition" field is not equivalent if i change the timeout value from my rewriter.

Last chance index/Fallback index? Yes Is the issue related to local customization? Yes Do we have all the default indexes created? Yes Describe the bug A clear and concise description of what the bug is.

When i change the timeout from 60 to 180 it devide the _raw count and also the repetition sum by 3. repetition field is calculate each time a log match.

image

To Reproduce Steps to reproduce the behavior: Change the timeout value

app-dest-rewrite-pan_panos.conf.txt

mstopa-splunk commented 3 months ago

hi @olivierpas if that's your custom parser and the problems is directly with the syslog-ng DSL we can help on the best effort basis. Please provide the minimal reproducible example

olivierpas commented 3 months ago

Hello @mstopa-splunk , and thank you for your reponse. what do you need exacly ? logs are pan:traffic sourcetype basis ( Palo Alto)

mstopa-splunk commented 3 months ago

@olivierpas I wondered if there was a way to recreate this in my lab but let's start on your side.

So count(_raw) != sum(repetition). But both correctly dropped to ~30% after 10:00, when you changed the timeout. This looks correct, so why is changing the timeout important for this case?

If the problem is that count(_raw) != sum(repetition), please send the same graph but by SC4S containers to make sure that all your SC4S instances include app-dest-rewrite-paloalto_panos-d_fmt_hec_default.

If the problem is about timeout, please explain

olivierpas commented 3 months ago

@mstopa-splunk yes the config is the same on all SC4S instance. The only issue is the difference of volume after the change of timeout value. i was wrong with my first request. The expecting result is to have the sum(total) more important than the count(_raw) value by changing the timeout from 60 to 180. The goal is the aggregate similar logs, not loosing them. image

mstopa-splunk commented 3 months ago

take a look what I can be missing here:

  1. I tried to make a minimal reproducible example:

    
    block parser app-dest-test-grouping-by() {
    channel {
        rewrite {
            r_set_splunk_dest_default(sourcetype("test:grouping-by"));
            set("t_kv_values", value(".splunk.sc4s_template"));
        };
    
        parser {
            grouping-by (
                key("${HOST}")
                aggregate(
                    value(".values.repetition" "$(- $(context-length) 1)")
                    inherit-mode(context)
                    tags("isGrouped")
                )
                # timeout(2)
                # timeout(6)
                timeout(12)
            );
        };
    };
    };

application app-dest-test-grouping-by[sc4s-lp-dest-format-d_hec_fmt] { parser { app-dest-test-grouping-by(); }; };


Then I was sending an event per second, but each 10 seconds I did 10 seconds break:

!/bin/bash

for ((hour=0; hour<1; hour++)); do for ((minute=0; minute<60; minute++)); do for ((second=0; second<60; second++)); do if ((second > 0 && second % 10 == 0)); then sleep 10 else sleep 1 fi echo $second echo "hello world" > /dev/udp/0.0.0.0/514 done done done



timeout=2 works good:
![image](https://github.com/splunk/splunk-connect-for-syslog/assets/139441697/622483ad-13cb-4e99-af1d-48efdf619139)

timeout=6 works good:
![image](https://github.com/splunk/splunk-connect-for-syslog/assets/139441697/c76ad1ab-6e3c-4c15-9923-d2582e637412)

timeout=12 is never reached because max_break==10 seconds so it keeps accumulating (https://axoflow.com/docs/axosyslog-core/chapter-correlating-log-messages/grouping-by-parser/grouping-by-parser-options/#grouping-by-parser-timeout)

Is it possible that some of your aggregations for `"${.values.src_ip}/${.values.dest_ip}/${.values.src_port}/${.values.dest_port}"` are for events with no breaks >= 180 and these are stuck and never reach Splunk because timeout is reset with every new event?
olivierpas commented 3 months ago

Hello, Thank you for your work. So your suggestion is that syslog-ng never release the aggregated log because the timeout is to high ? Ok i will try to reduce it and let you know.

mstopa-splunk commented 3 months ago
So your suggestion is that syslog-ng never release the aggregated log because the timeout is to high ?

Exactly. When you use my conf and script, but never do a break, syslog-ng aggregates forever. When you break the sending process, syslog-ng waits until timeout and only then releases the aggregated message. Every new event resets the timeout and this is confirmed in their docs.

mstopa-splunk commented 3 months ago

all right, for now I'm closing this issue as solved