microsoft / SCXcore

System Center Cross Platform Provider for Operations Manager
Microsoft Public License
36 stars 31 forks source link

"omiagent" segfault libnss_dns on Linux (scx provider) #96

Closed srice01 closed 4 years ago

srice01 commented 6 years ago

Copied over from https://github.com/Microsoft/omi/issues/491 (please see this for full communication on this issue).

On our RM provisioned VMs in Azure we noticed that the root partition is filling up with large numbers of "core.###" files in the /var/opt/omi/run directory.

Further investigation shows segmentation faults (in /var/log/messages) as follows:

Jan 30 10:17:17 ML001 kernel: omiagent[2054]: segfault at 7f6b7e181e00 ip 00007f6b7e181e00 sp 00007f6b78846d50 error 14 Jan 30 10:32:12 ML001 kernel: omiagent[3298]: segfault at 7fa8356f4e00 ip 00007fa8356f4e00 sp 00007fa82fdb9d50 error 14 Jan 30 11:02:19 ML001 kernel: omiagent[5873]: segfault at 7fbd84d3de00 ip 00007fbd84d3de00 sp 00007fbd7ede0d50 error 14 in libnss_dns-2.17.so[7fbd84ec5000+5000] Jan 30 11:17:19 ML001 kernel: omiagent[13175]: segfault at 7fbac740ae00 ip 00007fbac740ae00 sp 00007fbac54bcd50 error 14 in libnss_dns-2.17.so[7fbac7592000+5000] Jan 30 11:32:21 ML001 kernel: omiagent[20049]: segfault at 7f79230dfe00 ip 00007f79230dfe00 sp 00007f791d7a4d50 error 14 Jan 30 12:02:14 ML001 kernel: omiagent[46782]: segfault at 7f9fa8939e00 ip 00007f9fa8939e00 sp 00007f9fa296dd50 error 14 in libnss_dns-2.17.so[7f9fa8ac1000+5000]

Environment information:

Operating System: CentOS Release 7.4.1708 (fully patched, that is, "yum update" shows no updates pending).

So far the workaround has been to write a cron job (!) to periodically wipe the core files but obviously this is not an ideal situation.

Further information from "JumpingYang001":

Following debug info shows omiagent loaded scx provider:

(gdb) info sharedlibrary

From To Syms Read Shared Object Library 0x00007fa598b2f900 0x00007fa598b3ace1 Yes () /lib64/libpthread.so.0 0x00007fa598926e60 0x00007fa59892795e Yes () /lib64/libdl.so.2 0x00007fa598719670 0x00007fa598720d0c Yes () /lib64/libpam.so.0 0x00007fa5984bfbb0 0x00007fa5984fb58d Yes () /opt/omi/lib/libssl.so.1.0.0 0x00007fa5980b0f00 0x00007fa5981e8bd7 Yes () /opt/omi/lib/libcrypto.so.1.0.0 0x00007fa597ca0480 0x00007fa597de6bcf Yes () /lib64/libc.so.6 0x00007fa598d46b10 0x00007fa598d61440 Yes () /lib64/ld-linux-x86-64.so.2 0x00007fa597a5c100 0x00007fa597a62402 Yes () /lib64/libaudit.so.1 0x00007fa597818650 0x00007fa59784aa1a Yes () /lib64/libgssapi_krb5.so.2 0x00007fa597549a10 0x00007fa5975b0e8a Yes () /lib64/libkrb5.so.3 0x00007fa597321570 0x00007fa597322143 Yes () /lib64/libcom_err.so.2 0x00007fa5970f18c0 0x00007fa59710fc0f Yes () /lib64/libk5crypto.so.3 0x00007fa596ed9170 0x00007fa596ee56f8 Yes () /lib64/libz.so.1 0x00007fa596cd2580 0x00007fa596cd43bc Yes () /lib64/libcap-ng.so.0 0x00007fa596ac6890 0x00007fa596acd42b Yes () /lib64/libkrb5support.so.0 0x00007fa5968c05b0 0x00007fa5968c11cc Yes () /lib64/libkeyutils.so.1 0x00007fa5966a89d0 0x00007fa5966b77e1 Yes () /lib64/libresolv.so.2 0x00007fa596484ac0 0x00007fa59649a8c6 Yes () /lib64/libselinux.so.1 0x00007fa59621d5f0 0x00007fa5962635b0 Yes () /lib64/libpcre.so.1 0x00007fa595ed6430 0x00007fa596034438 Yes /opt/omi/lib/libSCXCoreProviderModule.so 0x00007fa598e0fcc0 0x00007fa598e2b568 Yes /opt/omi/lib/libmicxx.so 0x00007fa595b78e50 0x00007fa595b7daac Yes () /lib64/libcrypt.so.1 0x00007fa595972250 0x00007fa59597504c Yes () /lib64/librt.so.1 0x00007fa5956c3510 0x00007fa59572a5ba Yes () /lib64/libstdc++.so.6 0x00007fa59536b370 0x00007fa5953d6276 Yes () /lib64/libm.so.6 0x00007fa595152af0 0x00007fa5951622a5 Yes () /lib64/libgcc_s.so.1 0x00007fa594f4dba0 0x00007fa594f4e309 Yes () /lib64/libfreebl3.so 0x00007fa58e8131d0 0x00007fa58e81a3e1 Yes () /lib64/libnss_files.so.2 0x00007fa58e60c090 0x00007fa58e60f4f0 Yes () /lib64/libnss_dns.so.2 0x00007fa58d908ec0 0x00007fa58d933b0f Yes () /lib64/libssl3.so 0x00007fa58d6df380 0x00007fa58d6f3e57 Yes () /lib64/libsmime3.so 0x00007fa58d3c5740 0x00007fa58d498654 Yes () /lib64/libnss3.so 0x00007fa58d18b390 0x00007fa58d199d45 Yes () /lib64/libnssutil3.so 0x00007fa58cf7bf10 0x00007fa58cf7cc78 Yes () /lib64/libplds4.so 0x00007fa58cd77510 0x00007fa58cd78b78 Yes () /lib64/libplc4.so 0x00007fa58cb44ca0 0x00007fa58cb64cc0 Yes () /lib64/libnspr4.so 0x00007fa58c27e2d0 0x00007fa58c2a7f5c Yes () /lib64/libsoftokn3.so ---Type to continue, or q to quit--- 0x00007fa57e552a00 0x00007fa57e5da860 Yes () /lib64/libsqlite3.so.0 0x00007fa57e2c8bc0 0x00007fa57e32196d Yes () /lib64/libfreeblpriv3.so 0x00007fa58c077cd0 0x00007fa58c0783cb Yes () /lib64/libnsssysinit.so 0x00007fa57e09e7e0 0x00007fa57e0b8496 Yes (*) /lib64/libnsspem.so

(*): Shared library is missing debugging information.

(gdb) The crash is on 0x00007fa58e405e00 which is in /lib64/libnss_dns.so.2, that is same as your segmentation faults in /var/log/messages.

http://www.gnu.org/software/gdb/bugs/... Reading symbols from /opt/omi/bin/omiagent...done. [New LWP 123588] [New LWP 108293] [New LWP 108369] [New LWP 108394] [New LWP 108295] [New LWP 108294] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/opt/omi/bin/omiagent 9 10 --destdir / --providerdir /opt/omi/lib --loglevel WA'. Program terminated with signal 11, Segmentation fault.

0 0x00007fa58e405e00 in ?? ()

Missing separate debuginfos, use: debuginfo-install omi-1.4.2-1.x86_64 (gdb) bt

0 0x00007fa58e405e00 in ?? ()

1 0x00007fa58e449f47 in ?? ()

2 0x00007fa58e449b60 in ?? ()

3 0xffffffff00000073 in ?? ()

4 0x0000000000000000 in ?? ()

Here are the threads:

(gdb) info threads Id Target Id Frame 6 Thread 0x7fa598e01f00 (LWP 108294) 0x00007fa598b35cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 5 Thread 0x7fa598dc2f00 (LWP 108295) 0x00007fa598b35cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 4 Thread 0x7fa58e5caf00 (LWP 108394) 0x00007fa598b35cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 3 Thread 0x7fa58e609f00 (LWP 108369) 0x00007fa598b35cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 2 Thread 0x7fa598f53880 (LWP 108293) 0x00007fa597d707a3 in select () from /lib64/libc.so.6 (*) 1 Thread 0x7fa57ffff700 (LWP 123588) 0x00007fa58e405e00 in ?? ()

Please let me know if any further debug information is required.

srice01 commented 5 years ago

Is there any activity on this? Even after almost a year we are still having these same problems with a number of our nodes.

asoccer commented 5 years ago

This is still a very prominent issue in Azure is there seriously no work being put into this anymore? It's a broken tool that's causing production VM's to hit full on space

srice01 commented 5 years ago

We have ended up creating a cron job to delete the core files (hopefully frequently enough to avoid HD filling) rather than waiting for a fix from Microsoft that it appears will never come.

johanburati commented 5 years ago

@srice01 Are you still having this issue with the latest versions ?

srice01 commented 5 years ago

Yes (I am using CentOS 7.6.1810).

[root]# rpm -qa | grep -i omi omi-1.6.2-0.x86_64 [root]# rpm -qa | grep -i scx scx-1.6.3-659.x86_64 [root]# rpm -qa | grep -i walinux WALinuxAgent-2.2.42-1.el7.noarch [root]# rpm -qa | grep -i oms auoms-2.0.0-13.x86_64 omsagent-1.11.0-9.x86_64 omsconfig-1.1.1-926.x86_64

[root]# ls -al /var/opt/omi/run/ total 404888 drwxr-xr-x. 3 omi omi 4096 Sep 23 16:31 . drwxr-xr-x. 8 root root 81 May 30 04:23 .. -rw------- 1 root root 30789632 Sep 23 08:01 core.101250 -rw------- 1 root root 30789632 Sep 23 08:16 core.104089 -rw------- 1 root root 30814208 Sep 23 08:31 core.106826 -rw------- 1 root root 30728192 Sep 23 08:46 core.109601 -rw------- 1 root root 30711808 Sep 23 09:01 core.112330 -rw------- 1 root root 30728192 Sep 23 09:16 core.115170 -rw------- 1 root root 30793728 Sep 23 09:31 core.117975 -rw------- 1 root root 30801920 Sep 23 09:46 core.120825 -rw------- 1 root root 30814208 Sep 23 10:01 core.123533 -rw------- 1 root root 30814208 Sep 23 11:46 core.12592 ...

johanburati commented 5 years ago

@srice01 Could you please open a support ticket and tell them to engage me (joburati) ? That way I will be able to follow up with the devs internally and get this issue worked on.

srice01 commented 5 years ago

I am assuming you mean for me to create a support ticket in Azure. This is support request 119092422001455.

johanburati commented 5 years ago

Thanks @srice01, will get in touch with you via the ticket and try to get this moving.

johanburati commented 4 years ago

@srice01 Good news, I could fix the problem on your image.

The issue is that the DSCForLinux extension install version 1.1.1-294 of the dsc package, this version cause omiagent to segfault. Installing version 1.1.1-926 fixes the issue.

All those cases are related to this issue:

I have already submitted a fix to bump up the version of the dsc package:

I am following up with PG internally for them to merge and push the fix:

Meanwhile you can fix the issue by installing the package manually:

wget https://github.com/microsoft/PowerShell-DSC-for-Linux/releases/download/v1.1.1-926/dsc-1.1.1-926.ssl_098.x64.rpm
yum upgrade dsc-1.1.1-926.ssl_098.x64.rpm -y

I hope this helps.

srice01 commented 4 years ago

@johanburati - This is indeed good news. Given that the DSCForLinux extension is installed by Azure (not ourselves) I take it your changes are to make sure the fixed version is installed by default in future?

johanburati commented 4 years ago

@srice01 yes

Once my patch is merged and a new release of the DSCForLinux extension is pushed by the devs, it will be fixed for good. Until then you will have to bump up the version of the package manually.

johanburati commented 4 years ago

If you are having this issue check https://github.com/Azure/azure-linux-extensions/issues/875 for details and solution.

srice01 commented 4 years ago

24 hours after installing the update and I have seen no core dumps...So I believe this is now resolved.