kentik / ktranslate

System for pulling and pushing network data.
Apache License 2.0
57 stars 24 forks source link

Reopening packet loss on ping_only tests from issues 732 and 736 #738

Closed briansgill closed 2 months ago

briansgill commented 3 months ago

This is a follow-up to issues 732 and 736 - https://github.com/kentik/ktranslate/issues/736

There was still packetloss recorded on the initial polling but the bigger issue in the latest version released was some of the devices did not get polled for a lengthly time period as in below example. The poller was started up at 3pm but it seems this specific host got polled in the beginning and then no more polls on it until like 5:15pm timeframe. So a >2 hour gap. There were other hosts having the same type of behavior as well.

image

i3149 commented 3 months ago

Uh oh! Let me try with a new ping engine and see if that's any better.

i3149 commented 2 months ago

OK, if you're up for it, try one more time and let me know if this one is better. Latest fix building now.

briansgill commented 2 months ago

Thanks. Checking it out now via version kt-2024-08-27-10571748813

briansgill commented 2 months ago

@i3149 - looks like the new version is working as expected and not recording any false packet loss. Old version was giving pretty static results all the time. This version seems to be recording variable response times on the pings which probably is more reflective of real world results? What do you think of the results?

image

i3149 commented 2 months ago

Nice! This looks much more like real life to me. Behind the scenes we open sourced kentik's own icmp tool. This one works by sending Y packets for X seconds in its own thread and then reporting the result.

Before, we were trying to use https://github.com/prometheus-community/pro-bing which keeps state across time ticks and polling the stats every X seconds. The problems are as you discovered, mostly because state is kept across time intervals so you can get some weird results.