intel / openlldp

Other
54 stars 43 forks source link

lldpad generated core dump with SIGSEGV in dcbx_unregister #96

Closed SrijitNair closed 1 year ago

SrijitNair commented 1 year ago

Hi, The lldp on my alma crashed with a core dump. The analysis gives the following trace " Core was generated by `/usr/sbin/lldpad -t'. Program terminated with signal SIGSEGV, Segmentation fault.

0 0x000056219c1628ce in dcbx_unregister ()".

we are using : lldpad-1.0.1-16.git036e314.el8.x86_64 AlmaLinux release 8.6 (Sky Tiger)

Is there a known bug and a fix ? Thanks, Srijit

penguin359 commented 1 year ago

This is the first report of this issue so this would not be a known bug. As far as I know, dcbx_unregister() is only called in the exit path of lldpad and in the reconfigure path which is triggered by a SIGHUP signal to reload its configuration. What were you doing when this error was triggered?

Do you have the core dump? Do you have any packet traces from this period? Are you using DCBX in your environment? Do you have any special configuration you've done to OpenLLDP? How often is this error triggered?

We need a little more information before we can investigate this issue.

penguin359 commented 1 year ago

Also, I should mention that AlmaLinux 8.6 is no longer in support. Currently, only 8.8 and 9 is under support.

SrijitNair commented 1 year ago

Some more logs after turning on the verbose mode in lldpad lldpad[4711]: link status: 2 lldpad[4711]: device name: GE0-2_ll2 lldpad[4711]: *** LINK DOWN: GE0-2_ll2 lldpad[4711]: event_if_decode_nlmsg: calling ifdown for agent 0x5593648b3360. lldpad[4711]: mand_ifdown:port GE0-2_ll2 removed lldpad[4711]: basman_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: med_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: ieee8023_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: evb_ifdown:GE0-2_ll2 agent 2 called lldpad[4711]: evb_ifdown:GE0-2_ll2 agent 2 does not exist. lldpad[4711]: evb22_ifdown:GE0-2_ll2 agent 2 called lldpad[4711]: evb22_ifdown:GE0-2_ll2 agent 2 does not exist. lldpad[4711]: event_if_decode_nlmsg: calling ifdown for agent 0x5593648b3290. lldpad[4711]: mand_ifdown:port GE0-2_ll2 removed lldpad[4711]: basman_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: med_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: ieee8023_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: event_if_decode_nlmsg: calling ifdown for agent 0x5593648b31c0. systemd[1]: Stopping Link Layer Discovery Protocol Agent Daemon.... lldpad[4711]: mand_ifd systemd[1]: lldpad.service: Main process exited, code=dumped, status=11/SEGV systemd[1]: lldpad.service: Failed with result 'core-dump'. systemd[1]: Stopped Link Layer Discovery Protocol Agent Daemon..

SrijitNair commented 1 year ago

The core is generated everytime I stop the lldpad service. systemctl stop lldpad

penguin359 commented 1 year ago

Can you reproduce the crash on a supported release of AlmaLinux 8.8 or 9 and updated lldpad. 8.6 is no longer supported by them. Also, it would be helpful to get the core dump with the debuginfo and debugsource packages installed for analysis. You can find the lldpad debug packages for AlmaLinux 8.8 here, for example: http://repo.almalinux.org/vault/8.8/BaseOS/debug/x86_64/Packages/ Please update to a supported release of OpenLLDP from AlmaLinux and install the matching debug packages and try to reproduce the crash in that environment.

SrijitNair commented 1 year ago

Some more logs from dmesg : [16444.078882] lldpad[117136]: segfault at 0 ip 000055e8e90ad80e sp 00007ffee81a0ea0 error 6 in lldpad[55e8e9091000+61000] [16444.078892] Code: 85 db 74 50 0f 1f 84 00 00 00 00 00 48 8b 53 78 48 8b 83 80 00 00 00 48 85 d2 74 0e 48 89 82 80 00 00 00 48 8b 83 80 00 00 00 <48> 89 10 48 89 df e8 17 fe ff ff 48 8b 7b 18 e8 7e fd ff ff 48 89

SrijitNair commented 1 year ago

In this code, the module pointer should have a null check / BUG: need to check if tlvs are freed / void dcbx_unregister(struct lldp_module mod) { dcbx_remove_all(); deinit_drv_if(); if (mod->data) { dcbx_free_data((struct dcbd_user_data ) mod->data); free(mod->data); } free(mod); LLDPAD_DBG("%s: unregister dcbx complete.\n", func); }

penguin359 commented 1 year ago

I don't think that that is the cause of this crash. In every case that dcbx_register is called, it's called with this construct:

module->ops->lldp_mod_unregister(module);

Which means that module must be non-NULL in order for it to dereference it to find the lldp_mod_unregister operation which is what dcbx_unregister() is. If you want to produce a proper crash report, please install a supported version of lldap and it's matching debuginfo and debugsource packages. This will allow you to see the exact point it crashed. You will need to be using lldpad from AlmaLinux 8.8 which is a still supported release. We cannot assist you with an unsupported release like 8.6.

SrijitNair commented 1 year ago

(gdb) bt

0 dcbx_free_data (dud=) at lldp_dcbx.c:372

1 dcbx_unregister (mod=0x564625367c40) at lldp_dcbx.c:467

2 0x0000564623c6cd17 in deinit_modules () at lldpad.c:123

3 0x0000564623c6c8dc in main (argc=, argv=) at lldpad.c:469

apconole commented 1 year ago

something strange is going on - why is module pointer valid, but data pointer isn't? Are you using the latest code? How do you reproduce this?

SrijitNair commented 1 year ago

We disabled lldp on the hardware and enabled lldp on a veth pair. These are intel 810 cards. On execution of systemctl stop lldpad, the cores are generated. Its easily reproducible.

SrijitNair commented 1 year ago

dcbx_free_data((struct dcbd_user_data ) mod->data). The function dcbx_free_data is called with the mod->data pointer, which is cast to (struct dcbd_user_data ). If the casted pointer is not a valid pointer to struct dcbd_user_data, it can lead to undefined behavior or a segmentation fault I guess

SrijitNair commented 1 year ago

Is this a safer code and prevent a crash ?

/ BUG: need to check if tlvs are freed / void dcbx_unregister(struct lldp_module mod) { dcbx_remove_all(); deinit_drv_if(); if (mod->data) { dcbd_user_data_ptr = NULL; dcbd_user_data_ptr = (struct dcbd_user_data ) mod->data; if (dcbd_user_data_ptr != NULL){ LLDPAD_DBG("dcbd_user_data_ptr is valid. Calling dcbx_free_data()"); dcbx_free_data(dcbd_user_data_ptr); } if (mod->data){ LLDPAD_DBG("Calling free for mod-data"); free(mod->data); mod->data = NULL; } } free(mod); LLDPAD_DBG("%s: unregister dcbx complete.\n", func); }

SrijitNair commented 1 year ago

The NULL check is not helping [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/lldpad -t -V9'. Program terminated with signal SIGSEGV, Segmentation fault.

0 dcbx_free_data (dud=0x557147353c90) at lldp_dcbx.c:372

372 LIST_REMOVE(dd, entry); Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-164.el8.x86_64 libconfig-1.5-9.el8.x86_64 libnl3-3.5.0-1.el8.x86_64

penguin359 commented 1 year ago

I did not expect that the NULL check would help as that is the same pointer that is already checked with the surrounding if(mod->data) {...} check. If you look at the backtrack from your crash, you will see it failed inside of dcbx_free_data() which navigates a linked list and removed the item from that list. I suspect that linked-list has somehow been corrupted and should be examined at the point of the crash.

https://github.com/intel/openlldp/blob/11171b474f6f3cbccac5d608b7f26b32ff72c651/lldp_dcbx.c#L344

Also, if you want to help debug this crash, please update to a currently supported version for AlmaLinux. We cannot assist with software bugs from an unsupported distribution release. At a minimum, you should be running the latest package of lldpad found her for your distribution:

https://repo.almalinux.org/almalinux/8.8/BaseOS/x86_64/os/Packages/

Also, please install both the debuginfo and debugsource packages matching that and use that when producing further crash dumps/backtraces. These can be found here:

https://repo.almalinux.org/vault/8.8/BaseOS/debug/x86_64/Packages/

penguin359 commented 1 year ago

@SrijitNair I believe we have a fix for your issue. I did not catch it until today, but I believe PR #97 which @tabraham has pushed should fix the crash that you are seeing. Can you please try to build that PR and see if it prevents the crash you are seeing?

SrijitNair commented 1 year ago

I am trying to build the rpm from master and encountering error with the patch applied. what could be wrong ? 1 out of 13 hunks FAILED -- saving rejects to file lldp_dcbx.c.rej error: Bad exit status from /var/tmp/rpm-tmp.qlFXEt (%prep)

RPM build errors: Bad exit status from /var/tmp/rpm-tmp.qlFXEt (%prep)

SrijitNair commented 1 year ago

@SrijitNair I believe we have a fix for your issue. I did not catch it until today, but I believe PR #97 which @tabraham has pushed should fix the crash that you are seeing. Can you please try to build that PR and see if it prevents the crash you are seeing?

How can I get this patch onto alma ?

apconole commented 1 year ago

I've just merged to master and branch-1.1 - you can try building from upstream by checking out and following the installation instructions. A tl;dr - './bootstrap.sh && ./configure && make rpm'

SrijitNair commented 1 year ago

I've just merged to master and branch-1.1 - you can try building from upstream by checking out and following the installation instructions. A tl;dr - './bootstrap.sh && ./configure && make rpm'

Thanks. I will give that a try.

SrijitNair commented 1 year ago

@penguin359 @orgcandman @tabraham Thanks for you support and helping me to resolve this issue. Thanks, Srijit