kytos-ng / pathfinder

Kytos main path finder Network Application (NApp)
https://kytos-ng.github.io/api/pathfinder.html
MIT License
0 stars 7 forks source link

check for inconsistent topology graph in pathfinder #62

Open italovalcy opened 9 months ago

italovalcy commented 9 months ago

Hi,

We are facing a situation where pathfinder is (again) inconsistent with the topology. Links that should be there are not. Interfaces that should be in pathfinder, are not.

I wonder if there is any mechanism to validate the consistency of pathfinder graph against topology, instead of relying only/fully on the topology updated events (and dealing with race conditions). It would be nice if we could query pathfinder's graph from the API and then we can be able to leverage external tools to compare pathfinder's graph with topology graph's.

Also it would be nice if we: 1) manage to find and fix other possible race conditions; 2) implement a consistency check for pathfinder that queries for the full topology and try to check for differences on theirs.

Unfortunately, I dont have a method to reproduce. exactly now. Still investigating what happened.

italovalcy commented 9 months ago

It looks like pathfinder got inconsistent not due to race conditions, but due to a disconnected switch that shouldn't be down. Anyway, having a API to get the pathfinder's graph (to be able to run external tools to compare with topology graph) would be nice to detect situations like this (although other strategies could also point the root cause of this issue).

viniarck commented 9 months ago

Hi @italovalcy. Thanks for reporting a potential issue and looking into it.

but due to a disconnected switch that shouldn't be down

A disconnected dpid won't have its status UP, consequently won't be part of the graph, we've been using this approach for some time but so only switches considered UP (and ready to process OpenFlow messages) are found in the graph. But, if found an inconsistent state in the topology where a switch souldn't be down then that's probably the root cause.

Regarding the consistency check it might be an idea. Let's try to narrow down the root issue first though. Let's try to only add consistency if the underlying data source ins't the same (and should be equivalent) like switches flow stats vs stored flows, and mef_eline with sdntrace_cp tracing flows at a higher level layer, here pathfinder is sharing the same underlying resource contents from topology, and shouldn't have an inconsistency in the first place - unless also a potential race condition were to be happening (which doesn't seem the case) - so a consistency check would indeed just be a temporary fix for an existing issue, and arguably could still keep happening in a non deterministic interval, although I empathize with the fact that it could temporarily remedy it too.