robcowart / elastiflow

Network flow analytics (Netflow, sFlow and IPFIX) with the Elastic Stack
Other
2.48k stars 596 forks source link

Template length exceeds flowset length, skipping #205

Closed jlixfeld closed 4 years ago

jlixfeld commented 5 years ago

Hi there,

I don't think this is an EF issue, but I'd like to poke the crowd to see if anyone has any ideas before I but the LS folks.

I started seeing my Elastiflow graphs jump up across all metrics from, for example, tens of Gbps to hundreds of Tbps. Obviously bad data.

I correlated it with these logs starting to appear:

[2018-10-31T11:24:17,297][WARN ][logstash.codecs.netflow  ] Template length exceeds flowset length, skipping {:template_id=>260, :template_length=>65, :record_length=>60}
[2018-10-31T11:24:47,739][INFO ][logstash.filters.translate] refreshing dictionary file
[2018-10-31T11:24:47,740][INFO ][logstash.filters.translate] refreshing dictionary file
[2018-10-31T11:24:52,784][WARN ][logstash.codecs.netflow  ] Template length exceeds flowset length, skipping {:template_id=>266, :template_length=>109, :record_length=>104}
[2018-10-31T11:24:55,245][WARN ][logstash.codecs.netflow  ] Template length exceeds flowset length, skipping {:template_id=>260, :template_length=>65, :record_length=>60}
[2018-10-31T11:25:12,174][WARN ][logstash.codecs.netflow  ] Template length exceeds flowset length, skipping {:template_id=>260, :template_length=>65, :record_length=>60}
[2018-10-31T11:25:18,456][WARN ][logstash.codecs.netflow  ] Template length exceeds flowset length, skipping {:template_id=>266, :template_length=>109, :record_length=>104}
[2018-10-31T11:25:18,732][WARN ][logstash.codecs.netflow  ] Template length exceeds flowset length, skipping {:template_id=>266, :template_length=>109, :record_length=>104}
[2018-10-31T11:25:20,861][WARN ][logstash.codecs.netflow  ] Template length exceeds flowset length, skipping {:template_id=>266, :template_length=>109, :record_length=>104}
[2018-10-31T11:25:22,720][WARN ][logstash.codecs.netflow  ] Template length exceeds flowset length, skipping {:template_id=>266, :template_length=>109, :record_length=>104}
[2018-10-31T11:25:50,674][WARN ][logstash.codecs.netflow  ] Template length exceeds flowset length, skipping {:template_id=>266, :template_length=>109, :record_length=>104}
[2018-10-31T11:25:57,563][WARN ][logstash.codecs.netflow  ] Template length exceeds flowset length, skipping {:template_id=>260, :template_length=>65, :record_length=>60}

I've seem posts in various forums about bugs in netflow codecs, but those bugs are from months ago, all of which seem to have been fixed(?). I just verified that I'm running the latest coded:

root@ls1:/var/log/logstash# /usr/share/logstash/bin/logstash-plugin update logstash-codec-netflow
Updating logstash-codec-netflow
No plugin updated
root@ls1:/var/log/logstash#

Does this look familiar to anyone?

Thanks!

robcowart commented 5 years ago

Most likely you have devices that are sending the same template ID (in this case 260) but the content of the flow records is different. The codec is trying to decode the flows from one device with the template from the other.

This is an issue with the codec. https://github.com/logstash-plugins/logstash-codec-netflow/issues/9

jlixfeld commented 5 years ago

I think the subtext in the issues you cite suggest a workaround could be to use a different dport on the exporter. Am I reading into that correctly, or no?

jlixfeld commented 5 years ago

Hi Rob,

So it seems the only workaround is to stand up multiple LS instances. To that end, I've duplicated the elastiflow directories and added them as additional pipelines in logstash.yml per this link:

https://www.elastic.co/guide/en/logstash/6.2/multiple-pipelines.html

I changed the port in each instance of elastiflow${copy_number}/conf.d/10_input_netflow_ipv4.logstash.conf from port => "${ELASTIFLOW_NETFLOW_IPV4_PORT:2055}" to simply port => 205x (The port is unique across each of the 5 config files).

First time I tried to restart, I was getting Java out-of-memory errors, even though I've tripled the amount of memory allocated to the VM from 12GB to 32GB, so then I increased the heaps to

-Xms20g
-Xmx20g

in /etc/logstash/jvm.options.

Now, logstash seems to be taking an hour to start, and it's still not done.

Last entry in the log was from an hour ago:

root@ls1:/etc/logstash# date
Mon Nov  5 11:33:25 EST 2018
root@ls1:/etc/logstash# tail -1 /var/log/logstash/logstash-plain.log
[2018-11-05T10:29:50,173][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.4.2"}
root@ls1:/etc/logstash#

Java is running:

root@ls1:/etc/logstash# ps xa|grep java
  840 ?        SNsl  70:51 /usr/bin/java -Xms20g -Xmx20g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djruby.compile.invokedynamic=true -Djruby.jit.threshold=0 -XX:+HeapDumpOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -cp /usr/share/logstash/logstash-core/lib/jars/animal-sniffer-annotations-1.14.jar:/usr/share/logstash/logstash-core/lib/jars/commons-codec-1.11.jar:/usr/share/logstash/logstash-core/lib/jars/commons-compiler-3.0.8.jar:/usr/share/logstash/logstash-core/lib/jars/error_prone_annotations-2.0.18.jar:/usr/share/logstash/logstash-core/lib/jars/google-java-format-1.1.jar:/usr/share/logstash/logstash-core/lib/jars/gradle-license-report-0.7.1.jar:/usr/share/logstash/logstash-core/lib/jars/guava-22.0.jar:/usr/share/logstash/logstash-core/lib/jars/j2objc-annotations-1.1.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-annotations-2.9.5.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-core-2.9.5.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-databind-2.9.5.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-dataformat-cbor-2.9.5.jar:/usr/share/logstash/logstash-core/lib/jars/janino-3.0.8.jar:/usr/share/logstash/logstash-core/lib/jars/jruby-complete-9.1.13.0.jar:/usr/share/logstash/logstash-core/lib/jars/jsr305-1.3.9.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-api-2.9.1.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-core-2.9.1.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-slf4j-impl-2.9.1.jar:/usr/share/logstash/logstash-core/lib/jars/logstash-core.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.commands-3.6.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.contenttype-3.4.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.expressions-3.4.300.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.filesystem-1.3.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.jobs-3.5.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.resources-3.7.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.runtime-3.7.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.app-1.3.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.common-3.6.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.preferences-3.4.1.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.registry-3.5.101.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.jdt.core-3.10.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.osgi-3.7.1.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.text-3.5.101.jar:/usr/share/logstash/logstash-core/lib/jars/slf4j-api-1.7.25.jar org.logstash.Logstash --path.settings /etc/logstash
root@ls1:/etc/logstash#
top - 11:34:15 up  1:23,  1 user,  load average: 1.04, 1.01, 1.00
Tasks: 158 total,   1 running, 157 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy, 25.0 ni, 75.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32951964 total, 22114460 free, 10656812 used,   180692 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 21934460 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  840 logstash  39  19 23.585g 0.010t  19716 S 100.3 31.9  70:26.07 java

I'm going to assume you've had to take similar measures in some of your other deployments for folks who have been running a homogenous network with similar type exports with similar SW versions, etc. Have you seen anything like this before or do you just spin up entirely separate logstash servers? Any pointers?

robcowart commented 5 years ago

The way I would handle this is to break the pipeline into two parts. A "collector", which is basically a the UDP input w/codec and a redis output, and a "processor" which uses a redis input and does all of the post-decoder work and sending to Elasticsearch. You can then have multiple instances of the simple "collector" pipeline, each listening on different ports, and a single instance of the "processor". It would look like this...

collector---\
collector-- redis -- processor -- Elasticsearch
collector---/

Redis is really easy to setup in this use-case and actually can help reduce packet loss receiving flow data via UDP (that is another article to write). I have often thought about adding redis as a requirement for ElastiFlow as it bring a lot of benefits.

You could also try the new pipeline input and output instead of using redis. It should also work for your use-case, but I haven't tested it fully yet. Redis on the other hand is a proven technology for this scenario.

Either way, you will need to do a little restitching of the pipelines. It will however be worth it, as you will not have to run multiple instances of the heavy processing logic.

geriko2019 commented 5 years ago

The way I would handle this is to break the pipeline into two parts. A "collector", which is basically a the UDP input w/codec and a redis output, and a "processor" which uses a redis input and does all of the post-decoder work and sending to Elasticsearch. You can then have multiple instances of the simple "collector" pipeline, each listening on different ports, and a single instance of the "processor". It would look like this...

collector---\
collector-- redis -- processor -- Elasticsearch
collector---/

Redis is really easy to setup in this use-case and actually can help reduce packet loss receiving flow data via UDP (that is another article to write). I have often thought about adding redis as a requirement for ElastiFlow as it bring a lot of benefits.

You could also try the new pipeline input and output instead of using redis. It should also work for your use-case, but I haven't tested it fully yet. Redis on the other hand is a proven technology for this scenario.

Either way, you will need to do a little restitching of the pipelines. It will however be worth it, as you will not have to run multiple instances of the heavy processing logic.

Hi rob, I am getting the same error, with a single router configured. How else can Redis be configured? By default Redis listens to tcp traffic (on port 6379 by default). My challenge is that the "logstash-codec-netflow" can't be updated any further.

while using the following simple logstash configuration input & output i get an output:

input { udp { port => 2055 codec => netflow type => "netflow" queue_size => 10000 workers => 4 tags => [ "cisco_logs" ] } } output { elasticsearch { hosts => ["localhost:9200"] index => "test-%{+YYYY.MM.dd}" document_type => "netflow_logs" }

    stdout { codec => rubydebug }

}

      "host" => "XX.XX.XX.XX",
   "netflow" => {
        "l4_src_port" => 53,
          "direction" => 1,
            "in_pkts" => 1,
             "src_as" => 0,
       "flow_seq_num" => 4798,
           "in_bytes" => 88,
           "protocol" => 17,
    "flow_sampler_id" => 0,
      "ipv4_src_addr" => "XX.XX.XX.XX",
             "dst_as" => 0,
     "first_switched" => "2019-01-14T09:11:27.999Z",
      "ipv4_next_hop" => "XX.XX.XX.XX",
         "flowset_id" => 256,
        "l4_dst_port" => 54678,
            "version" => 9,
      "last_switched" => "2019-01-14T09:11:27.999Z",
           "src_mask" => 24,
        "output_snmp" => 1,
            "src_tos" => 0,
         "input_snmp" => 5,
          "tcp_flags" => 16,
           "dst_mask" => 26,
      "ipv4_dst_addr" => "XX.XX.XX.XX"
},
"@timestamp" => 2019-01-14T09:11:42.000Z,
      "type" => "netflow",
      "tags" => [
    [0] "cisco_logs"
],
  "@version" => "1"

How do i get my Elastiflow graphs to work

Note: I used the following instruction to setup Elastiflow: (URL: https://pandaways.com/elastiflow-with-mikrotik-and-centos-7/)

robcowart commented 5 years ago

If all that you have is one source of flows, and you are seeing this error, then something else is wrong. It is possible that the template the vendor is sending doesn't match the flow records. I would have to look at a PCAP which contains both the template and a flow record in order to investigate.

geriko2019 commented 5 years ago

elastiflow.zip I hope the PCAP is sufficient (find attached)

robcowart commented 5 years ago

You have two different devices sending flows. Both send a template with ID 256, but they contain different fields. This will cause issues as one template will be used to decode both sources, making at one of them always wrong. The only work around for this at the moment is to customize ElastiFlow for multiple collectors, with a common processing instance and some for of messaging queue in between.

geriko2019 commented 5 years ago

thanks rob

geriko2019 commented 5 years ago

hi rob, I've re-installed Elastiflow and configured one device. the following error seems to persist(latest pcap file attached):

[2019-01-16T09:57:02,462][WARN ][logstash.codecs.netflow ] Can't (yet) decode flowset id 256 from source id 0, because no template to decode it with has been received. This message will usually go away after 1 minute. [2019-01-16T09:57:13,469][WARN ][logstash.codecs.netflow ] Can't (yet) decode flowset id 256 from source id 0, because no template to decode it with has been received. This message will usually go away after 1 minute. [2019-01-16T09:57:26,944][WARN ][logstash.codecs.netflow ] Can't (yet) decode flowset id 256 from source id 0, because no template to decode it with has been received. This message will usually go away after 1 minute. [2019-01-16T09:57:38,409][WARN ][logstash.codecs.netflow ] Can't (yet) decode flowset id 256 from source id 0, because no template to decode it with has been received. This message will usually go away after 1 minute. [2019-01-16T09:57:49,407][WARN ][logstash.codecs.netflow ] Can't (yet) decode flowset id 256 from source id 0, because no template to decode it with has been received. This message will usually go away after 1 minute. [2019-01-16T09:57:56,394][WARN ][logstash.codecs.netflow ] Can't (yet) decode flowset id 256 from source id 0, because no template to decode it with has been received. This message will usually go away after 1 minute. [2019-01-16T09:58:07,392][WARN ][logstash.codecs.netflow ] Can't (yet) decode flowset id 256 from source id 0, because no template to decode it with has been received. This message will usually go away after 1 minute. [2019-01-16T09:58:19,388][WARN ][logstash.codecs.netflow ] Can't (yet) decode flowset id 256 from source id 0, because no template to decode it with has been received. This message will usually go away after 1 minute. [2019-01-16T09:58:31,389][WARN ][logstash.codecs.netflow ] Can't (yet) decode flowset id 256 from source id 0, because no template to decode it with has been received. This message will usually go away after 1 minute. [2019-01-16T09:58:43,387][WARN ][logstash.codecs.netflow ] Can't (yet) decode flowset id 256 from source id 0, because no template to decode it with has been received. This message will usually go away after 1 minute.

netflow-20190116.zip

geriko2019 commented 5 years ago

I'm using ElastiFlow v2.1.0

robcowart commented 5 years ago

How long has it been running. Some devices are really slow at sending templates. Fortinet in particular can take 15-30 minutes.

geriko2019 commented 5 years ago

Over an hour.

geriko2019 commented 5 years ago

issue was resolved. Thanks

ThatMorneGuy commented 5 years ago

The way I would handle this is to break the pipeline into two parts. A "collector", which is basically a the UDP input w/codec and a redis output, and a "processor" which uses a redis input and does all of the post-decoder work and sending to Elasticsearch. You can then have multiple instances of the simple "collector" pipeline, each listening on different ports, and a single instance of the "processor". It would look like this...

collector---\
collector-- redis -- processor -- Elasticsearch
collector---/

Redis is really easy to setup in this use-case and actually can help reduce packet loss receiving flow data via UDP (that is another article to write). I have often thought about adding redis as a requirement for ElastiFlow as it bring a lot of benefits.

You could also try the new pipeline input and output instead of using redis. It should also work for your use-case, but I haven't tested it fully yet. Redis on the other hand is a proven technology for this scenario.

Either way, you will need to do a little restitching of the pipelines. It will however be worth it, as you will not have to run multiple instances of the heavy processing logic.

@robcowart Do you have any guides on how to implement this?

odgeza commented 5 years ago

Out of interest, does the elasticsearch elastiflow netflow plugin also have the same problem with templates overlapping?

robcowart commented 5 years ago

The root cause is related to where input codecs (which decode the raw flow data) are located. So it is a Logstash-specific issue. Any solution based on Logstash will have the same challenge.

sgreszcz commented 4 years ago

Most likely you have devices that are sending the same template ID (in this case 260) but the content of the flow records is different. The codec is trying to decode the flows from one device with the template from the other.

This is an issue with the codec. logstash-plugins/logstash-codec-netflow#9

@robcowart looks like there is a patch for this: https://github.com/logstash-plugins/logstash-codec-netflow/issues/9#issuecomment-552129326

But with Logstash 7.4.2 being broken with netflow codec: https://github.com/robcowart/elastiflow/issues/427 I had to roll back to 7.3.2 (for some reason 7.4.2 works with collectd over UDP) I'm wondering if I should start looking at something else for processing netflow....

Filebeat also has its problems, especially around packets seemingly not being processes - I'll need to look at this tomorrow...

I'm assuming if I get stuff like this in my Logstash netflow codec logs then my netflow data isn't being processes properly?

[2019-11-11T20:57:59,701][WARN ][logstash.codecs.netflow ] Template length exceeds flowset length, skipping {:template_id=>257, :template_length=>91, :record_length=>52}

robcowart commented 4 years ago

This issue will be addressed once the following PRs are merged and released for the...

Logstash UDP Input: https://github.com/logstash-plugins/logstash-input-udp/pull/46 Logstash Netflow Codec: https://github.com/logstash-plugins/logstash-codec-netflow/pull/187

efbertos commented 4 years ago

The way I would handle this is to break the pipeline into two parts. A "collector", which is basically a the UDP input w/codec and a redis output, and a "processor" which uses a redis input and does all of the post-decoder work and sending to Elasticsearch. You can then have multiple instances of the simple "collector" pipeline, each listening on different ports, and a single instance of the "processor". It would look like this...

collector---\
collector-- redis -- processor -- Elasticsearch
collector---/

Redis is really easy to setup in this use-case and actually can help reduce packet loss receiving flow data via UDP (that is another article to write). I have often thought about adding redis as a requirement for ElastiFlow as it bring a lot of benefits. You could also try the new pipeline input and output instead of using redis. It should also work for your use-case, but I haven't tested it fully yet. Redis on the other hand is a proven technology for this scenario. Either way, you will need to do a little restitching of the pipelines. It will however be worth it, as you will not have to run multiple instances of the heavy processing logic.

@robcowart Do you have any guides on how to implement this?

I am wondering the same.

robcowart commented 4 years ago

Unfortunately the Elastic team declined to merge UDP input changes (see... logstash-plugins/logstash-input-udp#46). This leaves no other option than to continue to recommend the workaround of multiple instances of the ElastiFlow pipeline.

robcowart commented 3 years ago

This is no longer a problem for the new ElastiFlow Unified Flow Collector. In particular the new collector properly handles templates across devices and observation domains. More details are available... HERE.