jenningsloy318 / redfish_exporter

exporter to get metrics from redfish based hardware such as lenovo/dell/superc servers
Apache License 2.0
70 stars 62 forks source link

Feature request: Log count and intrusion detection #16

Closed NosIreland closed 4 years ago

NosIreland commented 4 years ago

Would it be possible to add metric for log entries and intrusion detection. Both of these would change system/chassis health to warning or critical. But at the moment if there is no way in seeing what is causing warning/critical state of system when there is intrusion detection or entries in system logs. log entries: https://hostname/redfish/v1/Systems/1/LogServices/Log1/Entries

{
    "@odata.context": "/redfish/v1/$metadata#LogEntryCollection.LogEntryCollection",
    "@odata.type": "#LogEntryCollection.LogEntryCollection",
    "@odata.id": "/redfish/v1/Systems/1/LogServices/Log1/Entries",
    "Name": "Health Event Log Service Collection",
    "Description": "Collection of Health Event Logs",
    "Members@odata.count": 2,
    "Members": [
        {
            "@odata.id": "/redfish/v1/Systems/1/LogServices/Log1/Entries/1",
            "@odata.type": "#LogEntry.v1_3_0.LogEntry",
            "Id": "1",
            "Name": "Health Event Log Entry 1",
            "EntryType": "Event",
            "Severity": "Warning",
            "Created": "2020-07-07T10:21:02+00:00",
            "EntryCode": "Deassert",
            "SensorType": "Battery",
            "SensorNumber": 93,
            "Message": "BBU presence (StorageController0)",
            "MessageArgs": [
                "ArrayOfMessageArgs"
            ],
            "Links": {
                "Oem": {}
            },
            "Oem": {
                "Supermicro": {
                    "MarkAsAcknowledged": false,
                    "@odata.type": "#SmcLogEntryExtensions.v1_0_0.LogEntry",
                    "RawEventData": {
                        "EventDirAndType": "0xF0",
                        "SensorType": "0x29",
                        "EventData1": "0x02",
                        "EventData2": "0x00",
                        "EventData3": "0x00"
                    }
                }
            }
        },
        {
            "@odata.id": "/redfish/v1/Systems/1/LogServices/Log1/Entries/2",
            "@odata.type": "#LogEntry.v1_3_0.LogEntry",
            "Id": "2",
            "Name": "Health Event Log Entry 2",
            "EntryType": "Event",
            "Severity": "OK",
            "Created": "2020-07-07T10:21:29+00:00",
            "EntryCode": "Assert",
            "SensorType": "Battery",
            "SensorNumber": 93,
            "Message": "BBU presence (StorageController0)",
            "MessageArgs": [
                "ArrayOfMessageArgs"
            ],
            "Links": {
                "Oem": {}
            },
            "Oem": {
                "Supermicro": {
                    "@odata.type": "#SmcLogEntryExtensions.v1_0_0.LogEntry",
                    "RawEventData": {
                        "EventDirAndType": "0x70",
                        "SensorType": "0x29",
                        "EventData1": "0x02",
                        "EventData2": "0x00",
                        "EventData3": "0x00"
                    }
                }
            }
        }
    ]
}

Intrusion: https://hostname/redfish/v1/Chassis/1

{
    "@odata.context": "/redfish/v1/$metadata#Chassis.Chassis",
    "@odata.type": "#Chassis.v1_4_0.Chassis",
    "@odata.id": "/redfish/v1/Chassis/1",
    "Id": "1",
    "Name": "Computer System Chassis",
    "ChassisType": "RackMount",
    "Manufacturer": "Supermicro",
    "Model": "X11SPW-TF",
    "SKU": "",
    "SerialNumber": "XXXXXXXX",
    "PartNumber": "CSE-116TS-R504WBP",
    "AssetTag": "",
    "IndicatorLED": "Off",
    "Status": {
        "State": "Enabled",
        "Health": "Critical",
        "HealthRollup": "Critical"
    },
    "PhysicalSecurity": {
        "IntrusionSensorNumber": 170,
        "IntrusionSensor": "HardwareIntrusion",
        "IntrusionSensorReArm": "Manual"
    },
    "Power": {
        "@odata.id": "/redfish/v1/Chassis/1/Power"
    },
    "Thermal": {
        "@odata.id": "/redfish/v1/Chassis/1/Thermal"
    },
    "Links": {
        "ComputerSystems": [
            {
                "@odata.id": "/redfish/v1/Systems/1"
            }
        ],
        "PCIeDevices": [
            {
                "@odata.id": "/redfish/v1/Systems/1/PCIeDevices/NIC1"
            }
        ],
        "ManagedBy": [
            {
                "@odata.id": "/redfish/v1/Managers/1"
            }
        ]
    },
    "Oem": {
        "Supermicro": {
            "@odata.type": "#SmcChassisExtensions.v1_0_0.Chassis",
            "BoardSerialNumber": "XXXXXX",
            "GUID": "34313031-4D53-3CEC-EF06-B1D500000000",
            "BoardID": "0x953"
        }
    }
}
jenningsloy318 commented 4 years ago

At the very beginning, I also come across same confusion regarding if it is required to implement this, but finally decided not, two reasons here:

  1. This plug /unplug action triggered this, but actually it is an event not metric 2.log contains too many arbbitory attributes,it is not easy to filter them into a common pattern which is essential for a monitoring metric set
NosIreland commented 4 years ago

Thanks for info, here is my take:

  1. Intrusion alert is a sensor and it goes red if triggered the same way as dead dimm, fan or psu. So I assume there would be 2 states.
  2. for log, it would be enough just to have a total count, no need to filter: "Members@odata.count": 2
jenningsloy318 commented 4 years ago

I checked the gofish code, and indeed this is a struct that hold the PhysicalSecurity data, at this point, I can add this metric. and meanwihle, only one metric is possible, check whether if the IntrusionSensorReArm is Manual or Automatic, and treat IntrusionSensor and IntrusionSensorNumber as the labels.

for log metrics, I need more consideration on this, minimal of the metrics to to collect the current entry counts, group them as different servirity, warning or critical, but here is also a tricky thing that the log entry will not be clear automatically, so this is always some value for this metric. and also the log entry timestamp is not irrelative with the metric timstamp, no easy to define the rules to determine the health state, so I think it is not practical here .

jenningsloy318 commented 4 years ago

@NosIreland I update this exporter, implemented physical security part, you can grab the source code and raise a test for it now

jenningsloy318 commented 4 years ago

No update for this issue, just close it