influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.64k stars 5.58k forks source link

[inputs.snmp] does not respect new network interfaces #15771

Open llamafilm opened 2 months ago

llamafilm commented 2 months ago

Relevant telegraf.conf

[agent]
  debug = false
  snmp_translator = "gosmi"
  interval = "5s"
  flush_interval = "5s"
  flush_jitter = "0s"

[[outputs.file]]
  files = ["stdout"]
  data_format = "influx"

[[inputs.snmp]]
  agents = ["10.37.155.80"]
  timeout = "1s"
  retries = 0
  [[inputs.snmp.field]]
    oid = "1.3.6.1.2.1.1.3.0"
    name = "sysUpTime"

[[inputs.ping]]
  urls = ["10.37.155.80"]
  method = "exec"
  count = 1
  timeout = 1.0
  fieldinclude = ["result_code", "maximum_response_ms"]

System info

Telegraf 1.30.3, Ubuntu 22.04

Steps to reproduce

  1. Launch a new EC2 instance with limited network access
  2. Run telegraf. Observe that both inputs (ping and snmp) are failing
  3. Attach a privileged ENI to the instance and make it the default route
  4. Observe that ping is now working, but SNMP is not
  5. Restart telegraf process and observe that both inputs work

Expected behavior

All input plugins should behave in the same way, using the new default source IP according to the system route table.

Actual behavior

SNMP input plugin continues using the old source IP until the telegraf process is restarted.

Additional info

Context

I run telegraf in Amazon EC2. The instance is launched with a default network interface which does not have permission to reach most targets through the firewall. Soon after boot-time, a system daemon attaches an ENI (elastic network interface) in the same subnet, which has a static IP with permission to get through the firewall.

The Problem

After adding a new ENI and making it the default route, Telegraf's SNMP input plugin continues sending packets from the old source IP.

Details

Route table before adding ENI:

$ ip route
default via 192.168.24.1 dev eth0 proto dhcp src 192.168.24.49 metric 100 
default via 192.168.24.1 dev eth0 metric 999 
192.168.24.0/26 dev eth0 proto kernel scope link src 192.168.24.49 metric 100 
192.168.24.1 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100 
192.168.24.2 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100 

Route table after adding ENI:

$ ip route
default via 192.168.24.1 dev eth1 metric 1 
default via 192.168.24.1 dev eth0 proto dhcp src 192.168.24.49 metric 100 
default via 192.168.24.1 dev eth1 proto dhcp src 192.168.24.51 metric 200 
default via 192.168.24.1 dev eth0 metric 999 
192.168.24.0/26 dev eth0 proto kernel scope link src 192.168.24.49 metric 100 
192.168.24.0/26 dev eth1 proto kernel scope link src 192.168.24.51 metric 200 
192.168.24.1 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100 
192.168.24.1 dev eth1 proto dhcp scope link src 192.168.24.51 metric 200 
192.168.24.2 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100 
192.168.24.2 dev eth1 proto dhcp scope link src 192.168.24.51 metric 200 

Telegraf log before adding the ENI:

2024-08-25T19:39:46Z E! [inputs.snmp] Error in plugin: agent 10.37.155.80: performing get on field sysUpTime: request timeout (after 0 retries)
2024-08-25T19:39:50Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 5s

Telegraf log after adding the ENI:

ping,host=aragorn-debug-i-0a885b5e0d5edd0a3,url=10.37.155.80 maximum_response_ms=33.831,result_code=0i 1724615055000000000
2024-08-25T19:44:21Z E! [inputs.snmp] Error in plugin: agent 10.37.155.80: performing get on field sysUpTime: request timeout (after 0 retries)

Telegraf log after restarting the process:

ping,host=aragorn-debug-i-0a885b5e0d5edd0a3,url=10.37.155.80 maximum_response_ms=33.721,result_code=0i 1724615265000000000
snmp,agent_host=10.37.155.80,host=aragorn-debug-i-0a885b5e0d5edd0a3 sysUpTime=682354i 1724615265000000000

tcpdump output:

# before adding ENI
19:54:50.001367 eth0  Out IP 192.168.24.49.33705 > 10.37.155.80.161:  GetRequest(28)  .1.3.6.1.2.1.1.3.0
19:54:50.103908 eth0  Out IP 192.168.24.49 > 10.37.155.80: ICMP echo request, id 291, seq 6, length 24
# after adding ENI
19:55:40.003930 eth1  Out IP 192.168.24.49.33705 > 10.37.155.80.161:  GetRequest(28)  .1.3.6.1.2.1.1.3.0
19:55:40.005212 eth1  Out IP 192.168.24.51 > 10.37.155.80: ICMP echo request, id 300, seq 1, length 24
19:55:40.038922 eth1  In  IP 10.37.155.80 > 192.168.24.51: ICMP echo reply, id 300, seq 1, length 24
# after restarting telegraf
19:55:45.000496 eth1  Out IP 192.168.24.51.63628 > 10.37.155.80.161:  GetRequest(28)  .1.3.6.1.2.1.1.3.0
19:55:45.004776 eth1  Out IP 192.168.24.51 > 10.37.155.80: ICMP echo request, id 301, seq 1, length 24
19:55:45.038505 eth1  In  IP 10.37.155.80 > 192.168.24.51: ICMP echo reply, id 301, seq 1, length 24
19:55:45.069883 eth1  In  IP 10.37.155.80.161 > 192.168.24.51.63628:  GetResponse(31)  .1.3.6.1.2.1.1.3.0=730354

Example commands to detach and re-attach ENI for debugging:

INSTANCE_ID=xxx
ENI_ID=xxx
ATTACHMENT_ID=$(aws --region us-west-2 --output text ec2 describe-network-interfaces --network-interface-ids $ENI_ID --query 'NetworkInterfaces[0].Attachment.AttachmentId')
aws --region us-west-2 ec2 detach-network-interface --attachment-id $ATTACHMENT_ID
aws --region us-west-2 ec2 attach-network-interface --device-index 1 --network-interface-id $ENI_ID --instance-id $INSTANCE_ID
srebhan commented 2 months ago

Next steps: Check error in SNMP and reconnect on timeout (or network errors in general).

llamafilm commented 2 months ago

@srebhan are you asking me to do something? Sorry I didn't understand your comment.

srebhan commented 2 months ago

@llamafilm happy to see a PR from your side, but this was more a note to myself as I couldn't work on it immediately. ;-)

llamafilm commented 2 weeks ago

FWIW, this issue is still the same in 1.32.1.