netenglabs / suzieq

Using network observability to operate and design healthier networks
https://www.stardustsystems.net/
Apache License 2.0
787 stars 104 forks source link

[Bug]: Cumulus: after the first round the device service doesn't write the data for three consecutive rounds #844

Open claudiolor opened 1 year ago

claudiolor commented 1 year ago

Suzieq version

0.20.0rc2

Install Type

container

Python version

3.8

Impacted component

sq-poller

Steps to Reproduce

Execute the poller with a Cumulus device where the net show system command doesn't work (e.g. 4.1.1).

Expected Behavior

The device is correctly polled and the retrieved data written in the dataset.

Observed Behavior

With PR #740 we introduced support for all the Cumulus devices where the net show system command doesn't work, adding in the service definition 2 additional commands. After the first round of polling during the following three rounds the following message is written in the log, and the retrieved data by the device service is not written even when there are changes:

2023-01-19 10:39:32,730 - suzieq.poller.worker.service - INFO - device ns.server101 node failure hysteresis skipping data commit

After the that data is correctly written. This happens because there always is a failing command, and so the service considers the group of commands as failed. After the hysteresis, everything works correctly.

Screenshots

Additional Context