microsoft / omi

Open Management Infrastructure
Other
360 stars 114 forks source link

omiagent segfault libSCXCoreProviderModule.so and eats all disk with coredumps #668

Closed pentiumoverdrive closed 4 years ago

pentiumoverdrive commented 4 years ago

I installed omsagent-1.12.15-0.universal.x64.sh on lots of servers recently, however, on one important production system the service keeps segfaulting and eating up all disk with coredumps.

grep libSCXCoreProviderModule.so /var/log/messages|grep segfault | wc -l 3515

The error messages reads like this:

omiagent[23465]: segfault at 0 ip 00007fd51c85e0ff sp 00007ffe051a14b0 error 4 in >libSCXCoreProviderModule.so[7fd51c6a4000+342000]

Unfortunately I had no time to save any more logs before recovering the system.

It managed to report in to my update management but never got assessed and then it stopped sending heartbeats eventually.

Any clues? RHEL 7.4 3.10.0-1062.12.1.el7.x86_64 #1 SMP Thu Dec 12 06:44:49 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

pentiumoverdrive commented 4 years ago

I checked our monitoring and the CPU was REALLY high a few hours after I installed the OMS agent.

pentiumoverdrive commented 4 years ago

Solution: Make sure FQDN is set properly otherwise scx fails to generates certificates and components starts segfaulting.

JumpingYang001 commented 4 years ago

@pentiumoverdrive thanks for sharing the solution! can you share me where to set FQDN for OMS agent?

pentiumoverdrive commented 4 years ago

@pentiumoverdrive thanks for sharing the solution! can you share me where to set FQDN for OMS agent?

In RHEL I set these: hostnamectl set-hostname my.whole.fqdn and then make sure you have the my.whole.fqdn set in /etc/hosts as well to point to your machine.

You can experiment with hostnamectl --pretty set-hostname my.whole.fqdn as well, but I'm not sure if all of these are needed.

the OMS agent is built upon so many components and they seem to have a bit slack releationship in between. If one crucial part fails in the installation process, the next component just continues.

JumpingYang001 commented 4 years ago

@pentiumoverdrive got you, thanks for sharing the solution detail!

JumpingYang001 commented 4 years ago

@pentiumoverdrive for omiagent crash, if it happen on Azure , you can create a ticket there and ask them to file an ICM with attached the core file and your oms version in the ticket, then we will analyze the issue in that way, since you have fixed the issue by setting FQDN, I will close this issue. If you need us to investigate the issue, you can create ticket on Azure.

JumpingYang001 commented 3 years ago

I find long fqdn domain may cause generate cert failed, too. After comment the 'search mydomainname' in /etc/resolv.conf, the install generate good pass. cert failed:

Setting up scx (1.6.6.0) ...
Generating certificate with hostname="test-ubuntu-1", domainname="f5olhd15lwve1imbirbyz01abc.bx.internal.cloudapp.net"
Error generating SSL certificate.  Use scxsslconfig to generate a new certificate, specifying host and domain names if necessary.  The error was: 'Unable to add the domain name to the subject.'
Hostname or domain likely not RFC compliant, trying fallback: "localhost.local"
Generating certificate with hostname="localhost", domainname="local"

cert pass:

Setting up scx (1.6.6.0) ...
Generating certificate with hostname="test-ubuntu-1"