This is to track a better solution then just causing a BMC panic when an OOM is detected.
Some ideas:
Using cgroups to classify applications into "kill this when you're low on memory" and "do not kill these". That way we can have things like fan control, the webUI or other well tested restartable services go first.
Have some telemetry/monitoring for per-application memory usage, so we have data on when an application is leaking memory
Use the PSI (memory pressure) information from the kernel to do the above
Utilize (and customize/hack) systemd-oomd to detect the OOM error before the kernel and do everything in userspace
Utilize phosphor-health-monitor to look for OOM conditions and have it take dump, log error, restart service
This is to track a better solution then just causing a BMC panic when an OOM is detected.
Some ideas: