Open shyam77git opened 11 months ago
HLD (md) PR: https://github.com/sonic-net/SONiC/pull/1527
Dell registered as the reviewer.
Community review recording https://zoom.us/rec/share/G4YPod_DoyMGGc8RG-A6jakAEOwR4INXe8pfG5IXrDKZS5ozbyghJyXASgEthkZq.8EmVrFbqX2t3qan3
@venkatmahalingam can you please let me know the github id of other reviewers from Dell? Thanks.
@bhaveshdell @rathnasabapathyv @prvattem
Basic Information (context) Any failure (or an error) impacting a system/chassis or a sub-system is regarded as a fault. Broadly classified into SW (Software) and HW (Hardware) faults:
Present State In SONiC, Fault is represented via an Event or an Alarm. SONiC has Event Framework HLD which can help event-detector to publish its event to the eventD redisDB. However, there is no Fault Manager/Handler which can take the needed/ platform-specified action(s) to recover the system from the generated fault.
Need for this feature This feature aims at adding a generic FM (Fault Management) Infrastructure which can do the following:
Benefits Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover from the fault. It can either go with the recommended action (provided by the fault source/detector) or override it with the system-level one.