microsoft / omi

Open Management Infrastructure
Other
360 stars 114 forks source link

omi 1.6.10 crashes while running Desired State Configuration PerformRequiredConfigurationChecks.py #727

Closed eehret closed 1 year ago

eehret commented 1 year ago

Same issue has been observed on RHEL 7.9, RHEL 8.6, and Ubuntu 18.04 LTS so far.

Azure DSC VM extension version: 3.7.0 (note that this also happens with other versions such as 3.0.0.7) dsc package version: dsc-1.2.3 OMI version: 1.6.10

As a controlled test to rule out as many external factors as possible, I set up an environment using a 'vanilla' RHEL 8.6 VM from Azure Marketplace (From Red Hat publisher), using a bare bones DSC configuration containing a single configuration item to just ensure a single package is present was was already present.

If I downgrade to 1.6.9 again, then it works once again.

Error output from /var/log/messages:

Aug 16 15:17:48 rhel-8-dsctest kernel: traps: omiagent[8605] general protection fault ip:46cdfe sp:7ffc445ddc50 error:0 in omiagent[400000+b8000] Aug 16 15:17:48 rhel-8-dsctest systemd[1]: Started Process Core Dump (PID 8701/UID 0). Aug 16 15:17:48 rhel-8-dsctest systemd-coredump[8702]: Process 8605 (omiagent) of user 0 dumped core.#012#012Stack trace of thread 8605:#012#0 0x000000000046cdfe n/a (omiagent)#012#1 0x00000000004049d8 n/a (omiagent)#012#2 0x00007f1e727f2cf3 __libc_start_main (libc.so.6)#012#3 0x0000000000404129 n/a (omiagent) Aug 16 15:17:48 rhel-8-dsctest systemd[1]: systemd-coredump@1-8701-0.service: Succeeded.

JumpingYang001 commented 1 year ago

@eehret I see latest dsc is 1.2.3-0, can you run rpm -qa|grep dsc;dpkg -l|grep dsc to check your dsc version? https://github.com/microsoft/PowerShell-DSC-for-Linux/releases

eehret commented 1 year ago

Hi @JumpingYang001 Yes, that's right. We're running 1.2.3 here as well. The DSC version I mentioned above was the version of the DSC VM extension for linux (for which dsc-1.2.3 is a dependency) I have updated the body of the issue to clarify.

JumpingYang001 commented 1 year ago

@eehret we don't have azure box, we just installed OMI-1.6.10-2 + DSC-1.2.3-0 locally and run PerformRequiredConfigurationChecks.py, it outputs

instance of PerformRequiredConfigurationChecks
{
    ReturnValue = 0
}

do we need to set anything on local?

eehret commented 1 year ago

Hi @JumpingYang001 No, not that I'm aware of. This was basically a "vanilla" VM from Azure marketplace, nothing was done to it other than turning it on and installing a couple of VM extensions.

FYI, we have pinned our OMI version to 1.6.9 across all environments in our tenant in order to avoid the bad update. Not ideal but at least we can avoid the crashes for now until another future OMI update gets released.

JumpingYang001 commented 1 year ago

@eehret can you ask support or create ticket on Azure portal or ask them to create ICM, etc. , so they might provide us a reproduced VM?

eehret commented 1 year ago

@JumpingYang001 Sorry I'm not really sure what you mean. You're Microsoft aren't you? You guys don't have an Azure tenant at your disposal for things like this?

I'm happy to help if there's specific information you need that I might be able to provide. I can reproduce the issue easily at any time.

JumpingYang001 commented 1 year ago

@eehret There might be some issue with OMI provider (DSC in this case) itself, it is better to open up a ticket with respective Azure service for first level of investigation, if required, OMI team will internally sync up with them.

eehret commented 1 year ago

Update: This is still happening with the following versions:

Azure DSC VM extension version: 3.8.3 dsc package version: dsc-1.2.4 OMI version: 1.6.11

Dec 12 18:00:01 rhel-8-dsctest kernel: traps: omiagent[675207] general protection fault ip:46cdfe sp:7ffecf339610 error:0 in omiagent[400000+b8000] Dec 12 18:00:02 rhel-8-dsctest systemd[1]: Starting Cleanup of Temporary Directories... Dec 12 18:00:02 rhel-8-dsctest systemd[1]: Started Process Core Dump (PID 675476/UID 0). Dec 12 18:00:02 rhel-8-dsctest systemd[1]: systemd-tmpfiles-clean.service: Succeeded. Dec 12 18:00:02 rhel-8-dsctest systemd[1]: Started Cleanup of Temporary Directories. Dec 12 18:00:02 rhel-8-dsctest systemd-coredump[675478]: Process 675207 (omiagent) of user 0 dumped core. Dec 12 18:00:02 rhel-8-dsctest systemd[1]: systemd-coredump@144-675476-0.service: Succeeded.

eehret commented 1 year ago

@JumpingYang001 If you don't believe this is the right place to report this issue then can you help me find the right place? I presume you understand how Microsoft is organized better than I do. Thanks!

eehret commented 1 year ago

I have now opened a support case with Microsoft. I don't know what ICM is, but hopefully we'll figure out how to get the problem resolved this way.

JumpingYang001 commented 1 year ago

@eehret once you create support case, someone from support team will contact you, then you can ask him/her to create ICM to reach DSC team at first, if required, DSC will add OMI team there, thanks.

eehret commented 1 year ago

@JumpingYang001 Thanks. Someone is taking a look at it now!

JumpingYang001 commented 1 year ago

Fixed in https://github.com/microsoft/omi/commit/a26b52a748bf942af5bc72aef80b7fb72e79b3df .