sonic-net / SONiC

Landing page for Software for Open Networking in the Cloud (SONiC) - https://sonic-net.github.io/SONiC/
2.23k stars 1.13k forks source link

Fault Management (Analysis and Handling) #1520

Open shyam77git opened 11 months ago

shyam77git commented 11 months ago

Basic Information (context) Any failure (or an error) impacting a system/chassis or a sub-system is regarded as a fault. Broadly classified into SW (Software) and HW (Hardware) faults:

Present State In SONiC, Fault is represented via an Event or an Alarm. SONiC has Event Framework HLD which can help event-detector to publish its event to the eventD redisDB. However, there is no Fault Manager/Handler which can take the needed/ platform-specified action(s) to recover the system from the generated fault.

Need for this feature This feature aims at adding a generic FM (Fault Management) Infrastructure which can do the following:

Benefits Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover from the fault. It can either go with the recommended action (provided by the fault source/detector) or override it with the system-level one.

shyam77git commented 10 months ago

HLD (md) PR: https://github.com/sonic-net/SONiC/pull/1527

zhangyanzhao commented 8 months ago

Dell registered as the reviewer.

zhangyanzhao commented 8 months ago

Community review recording https://zoom.us/rec/share/G4YPod_DoyMGGc8RG-A6jakAEOwR4INXe8pfG5IXrDKZS5ozbyghJyXASgEthkZq.8EmVrFbqX2t3qan3

zhangyanzhao commented 8 months ago

@venkatmahalingam can you please let me know the github id of other reviewers from Dell? Thanks.

venkatmahalingam commented 8 months ago

@bhaveshdell @rathnasabapathyv @prvattem