microsoft / SCXcore

System Center Cross Platform Provider for Operations Manager
Microsoft Public License
36 stars 31 forks source link

omiagent used 100% cpu with Azure Diagnostic Extension #92

Closed Lickkylee closed 9 months ago

Lickkylee commented 6 years ago

OS Version: CentOS 7.3 (3.10.0-514.26.2.el7.x86_64) OMI: OMI-1.0.8-6 scx: scx-1.6.2-337

we enabled diagnositic extension at 08/02 and noticed that omiagent in our Azure VM would eat up almost 100% CPU of one core since 10/23. This happened suddenly without any changing from our end. The issue lasted for a long time and still bothered us. Some times, it could be solved after we restarted the waagent service which would restarted the omi service as well.

we consulted omiagent engineer. they suggested to open an issue to the providers team since most high cpu would be caused by providers themselves. The diagnostic extension only called SCX providers so we ask for help here.

our troubleshooting:

  1. no logs in /var/opt/omi/log at the point the CPU raised high.
  2. in diagnositics logs: at the first time of the issue, we saw error like "Error: OMI EnumerateInstances failed". This error lasted for long time and still happened.
  3. the vm has docker installed, but not docker provider.
  4. it's the screenshot of the high cpu: omihighcpu

anyone has clue what's happending or how to do troubleshooting?

svrnwnsch commented 5 years ago

We had the same incidence. The logs in /var/log/azure/Microsoft.Azure.Diagnostics.LinuxDiagnostic/extension.log where full with repeating:

2019/03/31 03:07:23 [Microsoft.Azure.Diagnostics.LinuxDiagnostic-3.0.119] Error in MDSD:teInstances failed
2019/03/31 03:07:23 2019-03-31T03:04:06.5832980Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:04:06.6914160Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:04:06.6916010Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:04:06.7089990Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:04:06.7090760Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:05:36.5924470Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:05:36.7084390Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:05:36.7086500Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:05:36.7174610Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:05:36.7176520Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:07:06.5961030Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:07:06.7192690Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:07:06.7195010Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:07:06.7278580Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:07:06.7279850Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 
2019/03/31 03:07:23 [Microsoft.Azure.Diagnostics.LinuxDiagnostic-3.0.119] Daemon,success,1,message in mdsd.err:2019-03-31 03:07:06:teInstances failed
2019/03/31 03:07:23 2019-03-31T03:04:06.5832980Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:04:06.6914160Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:04:06.6916010Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:04:06.7089990Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:04:06.7090760Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:05:36.5924470Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:05:36.7084390Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:05:36.7086500Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:05:36.7174610Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:05:36.7176520Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:07:06.5961030Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:07:06.7192690Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:07:06.7195010Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:07:06.7278580Z: Error: OMI EnumerateInstances failed
2019/03/31 03:07:23 2019-03-31T03:07:06.7279850Z: Error: OMI EnumerateInstances failed

Restarting the server fixed the errors and high cpu usage.

JumpingYang001 commented 9 months ago

high cpu issue has been fixed in https://github.com/microsoft/pal/pull/117 and https://github.com/microsoft/pal/commit/6c0c108570ed3bb3850916677185f3f4134ca285.