Closed bmarlow closed 3 months ago
Hi @bmarlow,
Thank you for raising this issue.
A few things:
main
?Hi @lspgn Thanks for getting back to me so quickly!
First, a point of correction. I previously stated the templates were being sent every 5 minutes, that is not correct, they are being sent every 10 minutes.
Now, in response to your questions:
~ $ ps -o pid -o etime -o comm -o args
PID ELAPSED COMMAND COMMAND
1 15h14 goflow2 ./goflow2 -transport kafka -transport.kafka.topic ipfix-topic -transport.kafka.brokers 10.0.101.33:9092 -err.cnt 10000 -err.int 1s -format json -transport.kafka.flushbytes 10000 -templ
10 15h14 sh /bin/sh
I only have one instance of goflow2 running (it is running in podman)
[bmarlow@podman-host ~]$ podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c93109b6ed82 docker.io/apache/kafka:3.7.0 /etc/kafka/docker... 16 hours ago Up 16 hours 0.0.0.0:9092->9092/tcp kafka
6135e40e0cca localhost/goflow2:latest -transport kafka ... 15 hours ago Up 15 hours 0.0.0.0:8080->8080/tcp, 0.0.0.0:2055->2055/udp goflow2
Unfortunately the system sending the flow data isn't configurable with regard to the template interval. That being said, the templates are sent every 10 minutes, as the RFC suggests (RFC5153-6.2)
The domain ID and the sourceIP/port are always the same, the template numbers 280,281,284, 285 (there may be a couple of others, but of those 280 and 281 are 99% of the errors)
This container was built of the current main branch.
[bmarlow@podman-host goflow2]$ git rev-parse --verify HEAD
7265159473ecb80dae984d375682a132f963f270
[bmarlow@podman-host goflow2]$ git rev-parse --short HEAD
7265159
It was built using the following command from the top level inside the goflow2 directory
[bmarlow@podman-host goflow2]$ buildah bud -t goflow:latest .
And was run with this command:
[bmarlow@podman-host goflow2]$ podman run --name goflow2 -dt -p 8080:8080/tcp -p 2055:2055/udp --network shared -v ./templates/:/templates/:z localhost/goflow2:latest -transport "kafka" -transport.kafka.topic ipfix-topic -transport.kafka.brokers "10.0.101.33:9092" -err.cnt 10000 -err.int "1s" -format "json" -transport.kafka.flushbytes 10000 -templates.path "/templates" -transport.kafka.version "3.7.0"
Here is a zipped packet capture that captures all netflow data sent to the goflow2 collector over the span of about an hour. The most commonly used templates are 280 and 281 (which aligns with the error messages we see). You can also see that the template packets are sent like clockwork, every 10 minutes starting at packet 11929, and the templates data for 280 and 281 (and others) is setn in packet 11939 (and then every 10 minutes thereafter). ipfix.zip
Really strange. Thank you for providing the packet capture.
I dived into the code:
^ this function is not called, so I'm leaning towards a possible race condition
On small thing:
-templates.path "/templates"
This is an artifact of a previous CLI argument, it does not work.
It could be an issue with the "error aggregator" (controlled with err.int
and err.cnt
).
Do you see missing samples in Kafka/database? Since there's one flow per set per packet, can you check with the Sequence Number?
Could you try without error logging?
goflow2 -transport "kafka" -transport.kafka.topic ipfix-topic -transport.kafka.brokers "x.x.x.x:9092" -format "json" -transport.kafka.flushbytes 10000 -templates.path "/templates" -transport.kafka.version "3.7.0"
Could you try without Kafka?
goflow2 -err.cnt 10000 -err.int "1s" -format "json" > /dev/null
Once again, thank you for the speedy response, it is greatly appreciated.
Ahh, yes, apologies the template flag was an artifact of my flailing around trying to understand what was going on.
I had previously tried this without the err.cnt and err.int flags, but was constantly getting muted messages in the logs, and It was not clear to me if the muting would cause it to potentially miss template packets, or if that was muting just for log suppression.
RE: DeleteTemplate function- good to know, I saw that function but wasn't sure if it was being called from somewhere I couldn't find (admittedly I'm not much of a Go person).
[bmarlow@podman-host goflow2]$ podman run --name goflow2 -dt -p 8080:8080/tcp -p 2055:2055/udp --network shared localhost/goflow2:latest -transport "kafka" -transport.kafka.topic ipfix-topic -transport.kafka.brokers "10.0.101.33:9092" -format "json" -transport.kafka.flushbytes 10000 -templates.path "/templates" -transport.kafka.version "3.7.0"
And the behavior was the same. I verified that my kafka consumer was properly receiving messages by using another machine to connect to Kafka as a producer and send a message on that topic using the standard Kafka tools:
[bmarlow@aworkstation bin]$ ./kafka-console-producer.sh --bootstrap-server 10.0.101.33:9092 --topic ipfix-topic
>test
[bmarlow@bmarlow-anotherworkstation]$ python connector.py
Streams initiated...
test
I have also attached a pcap during this time from the host running the goflow2 container. This PCAP captures both the IPFIX as well as the Kafka traffic. I'm not sure if the program is set to batch messages to kafka, but I see the following behavior:
One interesting difference I see (maybe just because the logs are scrolling as fast) is that I now see a template error every once in a while in the logs:
WARN[1292] template error blocking=false count=1 error="receiver: message from [::ffff:10.89.1.18]:47961 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[1293] template error blocking=false count=1 error="receiver: message from [::ffff:10.89.1.18]:47961 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[1293] too many receiver messages, muting
WARN[1302] skipped 191 receiver messages
WARN[1302] template error blocking=false count=1 error="receiver: message from [::ffff:10.89.1.18]:47961 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
Looking at the PCAP:
I started up goflow2 using the following command:
[bmarlow@podman-host]$ podman run --name goflow2 -dt -p 8080:8080/tcp -p 2055:2055/udp localhost/goflow2:latest -err.cnt 10000 -err.int "1s" -format "json" > /dev/null
I immediately started getting errors (as I would expect since I hadn't received the template yet):
WARN[0136] template error blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0137] template error blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
After a while I received the template packets and things started properly recording:
{"type":"IPFIX","time_received_ns":1717953981897508819,"sequence_num":5383103,"sampling_rate":0,"sampler_address":"10.0.2.100","time_flow_start_ns":1717953981899000000,"time_flow_end_ns":1717953981899000000,"bytes":49200,"packets":400,"src_addr":"10.99.0.51","dst_addr":"10.128.1.140","etype":"IPv4","proto":"TCP","src_port":6443,"dst_port":56196,"in_if":7,"out_if":0,"src_mac":"aa:ae:30:ca:10:32","dst_mac":"0a:58:0a:80:01:8c","src_vlan":0,"dst_vlan":0,"vlan_id":0,"ip_tos":0,"forwarding_status":0,"ip_ttl":0,"ip_flags":0,"tcp_flags":0,"icmp_type":0,"icmp_code":0,"ipv6_flow_label":0,"fragment_id":0,"fragment_offset":0,"src_as":0,"dst_as":0,"next_hop":"","next_hop_as":0,"src_net":"0.0.0.0/0","dst_net":"0.0.0.0/0","bgp_next_hop":"","bgp_communities":[],"as_path":[],"mpls_ttl":[],"mpls_label":[],"mpls_ip":[],"observation_domain_id":0,"observation_point_id":0}
{"type":"IPFIX","time_received_ns":1717953982062822738,"sequence_num":5383104,"sampling_rate":0,"sampler_address":"10.0.2.100","time_flow_start_ns":1717953982064000000,"time_flow_end_ns":1717953982064000000,"bytes":33200,"packets":400,"src_addr":"10.128.0.52","dst_addr":"10.128.0.77","etype":"IPv4","proto":"TCP","src_port":6443,"dst_port":49144,"in_if":472,"out_if":0,"src_mac":"0a:58:0a:80:00:34","dst_mac":"0a:58:0a:80:00:4d","src_vlan":0,"dst_vlan":0,"vlan_id":0,"ip_tos":0,"forwarding_status":0,"ip_ttl":0,"ip_flags":0,"tcp_flags":0,"icmp_type":0,"icmp_code":0,"ipv6_flow_label":0,"fragment_id":0,"fragment_offset":0,"src_as":0,"dst_as":0,"next_hop":"","next_hop_as":0,"src_net":"0.0.0.0/0","dst_net":"0.0.0.0/0","bgp_next_hop":"","bgp_communities":[],"as_path":[],"mpls_ttl":[],"mpls_label":[],"mpls_ip":[],"observation_domain_id":0,"observation_point_id":0}
Then after ~<1 minute later, I start getting the error messages again, and no IPFIX data is logged (just errors)
WARN[0325] template error blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:60372 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0325] template error blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:60372 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0325] template error blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:60372 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
Then the pattern repeats:
@bmarlow my apologies for the delay this time :(
Thank you for all the detailed tests. I'll have a better look soon and try to reproduce and replay the pcap.
Given that we are seeing the same behavior for both transport mechanisms (stdout and kafka) this seems to point to something central in how goflow2 is storing/managing the templates after they are received.
AFAIK there are no expiration. Maybe a packet is corrupting the template
Did you also collect the pcap for Test 2?
There's one more test that could help:
-produce=raw
. This could potentially help isolating a bad packet.One thing I just noticed: in the logs, the source port is different... 60372
vs 37907
Couldn't reproduce with the pcap.
The code that computes the key to find the template set uses IP+Port:
https://github.com/netsampler/goflow2/blob/f363efffcd870f132ce157e72ec758f06b03739a/utils/udp.go#L29
If you have a way to test manually:
key := pkt.Src.Addr().String()
should make the key ignore the port.
Let me know if that works for you.
@bmarlow let me know if you're still experiencing the issue :)
@lspgn
Sorry for the delay in response, I've been wrapped up in other projects.
Unfortunately I haven't been able to recreate this.
I've tried replaying those packet captures, even against previous branches, and I can't recreate it.
I've since set up another cluster and had it send flow data against previous configs that were failing and it seems to work just fine now as well.
Possibly it was the previous OpenShift cluster version sending a bad packet as previously hypothesized.
I'm confused why I can't recreate it, but I guess I'm glad its working correctly now.
I'll re-open with appropriate data if it re-occurs. I appreciate the time spent on this.
Describe the bug When running the collector, it constantly gives the following error messages for a few minutes:
WARN[0954] template error blocking=false count=1 error="receiver: message from [::ffff:10.89.1.16]:38705 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0954] template error blocking=false count=1 error="receiver: message from [::ffff:10.89.1.16]:38705 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
Which I would expect, as in the beginning it hasn't gotten the template yet. After a few minutes when the template packets get sent I quit getting the error messages and everything decodes fine and the expected data gets sent off to kafka.
After about a minute, all of the sudden I start getting the same errors again. It will take about 5 minutes (the template interval) and the whole process will repeat itself.
To Reproduce Steps to reproduce the behavior:
podman run --name goflow2 -dt -p 8080:8080/tcp -p 2055:2055/udp --network shared -v ./templates/:/templates/:z localhost/goflow2:latest -transport "kafka" -transport.kafka.topic ipfix-topic -transport.kafka.brokers "x.x.x.x:9092" -err.cnt 10000 -err.int "1s" -format "json" -transport.kafka.flushbytes 10000 -templates.path "/templates" -transport.kafka.version "3.7.0"
Expected behavior Once the templates are learned they shoudn't be forgotten.
Captures Attached is wirehark of the behavior. It is worth noting that wireshark properly decodes, so after it has received the template packets it goes and applies the template to all of the captured packets.
Sampler device:
GoFlow2:
Additional context Not sure what else to say other than after aquiring the template it randomly loses it. The IPFIX source reliably sends the template packets every 5 minutes, then shortly thereafter goflow2 loses it.