Closed pentiumoverdrive closed 4 years ago
I checked our monitoring and the CPU was REALLY high a few hours after I installed the OMS agent.
Solution: Make sure FQDN is set properly otherwise scx fails to generates certificates and components starts segfaulting.
@pentiumoverdrive thanks for sharing the solution! can you share me where to set FQDN for OMS agent?
@pentiumoverdrive thanks for sharing the solution! can you share me where to set FQDN for OMS agent?
In RHEL I set these: hostnamectl set-hostname my.whole.fqdn and then make sure you have the my.whole.fqdn set in /etc/hosts as well to point to your machine.
You can experiment with hostnamectl --pretty set-hostname my.whole.fqdn as well, but I'm not sure if all of these are needed.
the OMS agent is built upon so many components and they seem to have a bit slack releationship in between. If one crucial part fails in the installation process, the next component just continues.
@pentiumoverdrive got you, thanks for sharing the solution detail!
@pentiumoverdrive for omiagent crash, if it happen on Azure , you can create a ticket there and ask them to file an ICM with attached the core file and your oms version in the ticket, then we will analyze the issue in that way, since you have fixed the issue by setting FQDN, I will close this issue. If you need us to investigate the issue, you can create ticket on Azure.
I find long fqdn domain may cause generate cert failed, too. After comment the 'search mydomainname' in /etc/resolv.conf, the install generate good pass. cert failed:
Setting up scx (1.6.6.0) ...
Generating certificate with hostname="test-ubuntu-1", domainname="f5olhd15lwve1imbirbyz01abc.bx.internal.cloudapp.net"
Error generating SSL certificate. Use scxsslconfig to generate a new certificate, specifying host and domain names if necessary. The error was: 'Unable to add the domain name to the subject.'
Hostname or domain likely not RFC compliant, trying fallback: "localhost.local"
Generating certificate with hostname="localhost", domainname="local"
cert pass:
Setting up scx (1.6.6.0) ...
Generating certificate with hostname="test-ubuntu-1"
I installed omsagent-1.12.15-0.universal.x64.sh on lots of servers recently, however, on one important production system the service keeps segfaulting and eating up all disk with coredumps.
The error messages reads like this:
Unfortunately I had no time to save any more logs before recovering the system.
It managed to report in to my update management but never got assessed and then it stopped sending heartbeats eventually.
Any clues? RHEL 7.4 3.10.0-1062.12.1.el7.x86_64 #1 SMP Thu Dec 12 06:44:49 EST 2019 x86_64 x86_64 x86_64 GNU/Linux