netsampler / goflow2

High performance sFlow/IPFIX/NetFlow Collector
BSD 3-Clause "New" or "Revised" License
423 stars 99 forks source link

IPFIX Templates flush from memory quickly #330

Open bmarlow opened 4 weeks ago

bmarlow commented 4 weeks ago

Describe the bug When running the collector, it constantly gives the following error messages for a few minutes: WARN[0954] template error blocking=false count=1 error="receiver: message from [::ffff:10.89.1.16]:38705 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2

WARN[0954] template error blocking=false count=1 error="receiver: message from [::ffff:10.89.1.16]:38705 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2

Which I would expect, as in the beginning it hasn't gotten the template yet. After a few minutes when the template packets get sent I quit getting the error messages and everything decodes fine and the expected data gets sent off to kafka.

After about a minute, all of the sudden I start getting the same errors again. It will take about 5 minutes (the template interval) and the whole process will repeat itself.

To Reproduce Steps to reproduce the behavior:

  1. I run goflow2 as such: podman run --name goflow2 -dt -p 8080:8080/tcp -p 2055:2055/udp --network shared -v ./templates/:/templates/:z localhost/goflow2:latest -transport "kafka" -transport.kafka.topic ipfix-topic -transport.kafka.brokers "x.x.x.x:9092" -err.cnt 10000 -err.int "1s" -format "json" -transport.kafka.flushbytes 10000 -templates.path "/templates" -transport.kafka.version "3.7.0"
  2. Wait for templates to get sent
  3. Wait for goflow2 to mysteriously lose the templates

Expected behavior Once the templates are learned they shoudn't be forgotten.

Captures Attached is wirehark of the behavior. It is worth noting that wireshark properly decodes, so after it has received the template packets it goes and applies the template to all of the captured packets.

Sampler device:

GoFlow2:

Additional context Not sure what else to say other than after aquiring the template it randomly loses it. The IPFIX source reliably sends the template packets every 5 minutes, then shortly thereafter goflow2 loses it. Screenshot from 2024-06-07 18-37-46

lspgn commented 4 weeks ago

Hi @bmarlow,

Thank you for raising this issue.

A few things:

bmarlow commented 4 weeks ago

Hi @lspgn Thanks for getting back to me so quickly!

First, a point of correction. I previously stated the templates were being sent every 5 minutes, that is not correct, they are being sent every 10 minutes.

Now, in response to your questions:

~ $ ps -o pid -o etime -o comm -o args
PID   ELAPSED COMMAND          COMMAND
    1 15h14   goflow2          ./goflow2 -transport kafka -transport.kafka.topic ipfix-topic -transport.kafka.brokers 10.0.101.33:9092 -err.cnt 10000 -err.int 1s -format json -transport.kafka.flushbytes 10000 -templ
   10 15h14   sh               /bin/sh

Here is a zipped packet capture that captures all netflow data sent to the goflow2 collector over the span of about an hour. The most commonly used templates are 280 and 281 (which aligns with the error messages we see). You can also see that the template packets are sent like clockwork, every 10 minutes starting at packet 11929, and the templates data for 280 and 281 (and others) is setn in packet 11939 (and then every 10 minutes thereafter). ipfix.zip

lspgn commented 4 weeks ago

Really strange. Thank you for providing the packet capture.

I dived into the code:

https://github.com/netsampler/goflow2/blob/7265159473ecb80dae984d375682a132f963f270/decoders/netflow/templates.go#L60C32-L60C46

^ this function is not called, so I'm leaning towards a possible race condition

On small thing:

-templates.path "/templates"

This is an artifact of a previous CLI argument, it does not work.

It could be an issue with the "error aggregator" (controlled with err.int and err.cnt). Do you see missing samples in Kafka/database? Since there's one flow per set per packet, can you check with the Sequence Number?

Could you try without error logging?

goflow2 -transport "kafka" -transport.kafka.topic ipfix-topic -transport.kafka.brokers "x.x.x.x:9092" -format "json" -transport.kafka.flushbytes 10000 -templates.path "/templates" -transport.kafka.version "3.7.0"

Could you try without Kafka?

goflow2 -err.cnt 10000 -err.int "1s" -format "json" > /dev/null
bmarlow commented 3 weeks ago

Once again, thank you for the speedy response, it is greatly appreciated.

Ahh, yes, apologies the template flag was an artifact of my flailing around trying to understand what was going on.

I had previously tried this without the err.cnt and err.int flags, but was constantly getting muted messages in the logs, and It was not clear to me if the muting would cause it to potentially miss template packets, or if that was muting just for log suppression.

RE: DeleteTemplate function- good to know, I saw that function but wasn't sure if it was being called from somewhere I couldn't find (admittedly I'm not much of a Go person).

Test 1

[bmarlow@podman-host goflow2]$ podman run --name goflow2 -dt -p 8080:8080/tcp -p 2055:2055/udp --network shared localhost/goflow2:latest -transport "kafka" -transport.kafka.topic ipfix-topic -transport.kafka.brokers "10.0.101.33:9092" -format "json" -transport.kafka.flushbytes 10000 -templates.path "/templates" -transport.kafka.version "3.7.0"

And the behavior was the same. I verified that my kafka consumer was properly receiving messages by using another machine to connect to Kafka as a producer and send a message on that topic using the standard Kafka tools:

[bmarlow@aworkstation bin]$ ./kafka-console-producer.sh --bootstrap-server 10.0.101.33:9092 --topic ipfix-topic
>test
[bmarlow@bmarlow-anotherworkstation]$ python connector.py 
Streams initiated...
test

I have also attached a pcap during this time from the host running the goflow2 container. This PCAP captures both the IPFIX as well as the Kafka traffic. I'm not sure if the program is set to batch messages to kafka, but I see the following behavior:

  1. Template packets received
  2. Several messages sent to Kafka
  3. Suddenly messages to Kafka stop, within about a minute or so (hard to correlate this with exact log messages since the default log output suppresses)
  4. Goflow2 still receiving lots of IPFIX, but not sending anything to Kafka (at this point in time I would expect that it would be sending about 1 message a second to kafka, unless their is batching that I'm unaware of)
  5. Repeat 1-4 every 10 minutes, as template packets are received.

One interesting difference I see (maybe just because the logs are scrolling as fast) is that I now see a template error every once in a while in the logs:

WARN[1292] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.89.1.18]:47961 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[1293] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.89.1.18]:47961 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[1293] too many receiver messages, muting           
WARN[1302] skipped 191 receiver messages                
WARN[1302] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.89.1.18]:47961 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2

Looking at the PCAP:

ipfix_and_kafka.zip

Test 2

I started up goflow2 using the following command:

[bmarlow@podman-host]$ podman run --name goflow2 -dt -p 8080:8080/tcp -p 2055:2055/udp localhost/goflow2:latest -err.cnt 10000 -err.int "1s" -format "json" > /dev/null

I immediately started getting errors (as I would expect since I hadn't received the template yet):


WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0137] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2

After a while I received the template packets and things started properly recording:

{"type":"IPFIX","time_received_ns":1717953981897508819,"sequence_num":5383103,"sampling_rate":0,"sampler_address":"10.0.2.100","time_flow_start_ns":1717953981899000000,"time_flow_end_ns":1717953981899000000,"bytes":49200,"packets":400,"src_addr":"10.99.0.51","dst_addr":"10.128.1.140","etype":"IPv4","proto":"TCP","src_port":6443,"dst_port":56196,"in_if":7,"out_if":0,"src_mac":"aa:ae:30:ca:10:32","dst_mac":"0a:58:0a:80:01:8c","src_vlan":0,"dst_vlan":0,"vlan_id":0,"ip_tos":0,"forwarding_status":0,"ip_ttl":0,"ip_flags":0,"tcp_flags":0,"icmp_type":0,"icmp_code":0,"ipv6_flow_label":0,"fragment_id":0,"fragment_offset":0,"src_as":0,"dst_as":0,"next_hop":"","next_hop_as":0,"src_net":"0.0.0.0/0","dst_net":"0.0.0.0/0","bgp_next_hop":"","bgp_communities":[],"as_path":[],"mpls_ttl":[],"mpls_label":[],"mpls_ip":[],"observation_domain_id":0,"observation_point_id":0}
{"type":"IPFIX","time_received_ns":1717953982062822738,"sequence_num":5383104,"sampling_rate":0,"sampler_address":"10.0.2.100","time_flow_start_ns":1717953982064000000,"time_flow_end_ns":1717953982064000000,"bytes":33200,"packets":400,"src_addr":"10.128.0.52","dst_addr":"10.128.0.77","etype":"IPv4","proto":"TCP","src_port":6443,"dst_port":49144,"in_if":472,"out_if":0,"src_mac":"0a:58:0a:80:00:34","dst_mac":"0a:58:0a:80:00:4d","src_vlan":0,"dst_vlan":0,"vlan_id":0,"ip_tos":0,"forwarding_status":0,"ip_ttl":0,"ip_flags":0,"tcp_flags":0,"icmp_type":0,"icmp_code":0,"ipv6_flow_label":0,"fragment_id":0,"fragment_offset":0,"src_as":0,"dst_as":0,"next_hop":"","next_hop_as":0,"src_net":"0.0.0.0/0","dst_net":"0.0.0.0/0","bgp_next_hop":"","bgp_communities":[],"as_path":[],"mpls_ttl":[],"mpls_label":[],"mpls_ip":[],"observation_domain_id":0,"observation_point_id":0}

Then after ~<1 minute later, I start getting the error messages again, and no IPFIX data is logged (just errors)

WARN[0325] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:60372 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0325] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:60372 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0325] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:60372 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2

Then the pattern repeats:

Notes