Closed SrijitNair closed 1 year ago
This is the first report of this issue so this would not be a known bug. As far as I know, dcbx_unregister() is only called in the exit path of lldpad and in the reconfigure path which is triggered by a SIGHUP signal to reload its configuration. What were you doing when this error was triggered?
Do you have the core dump? Do you have any packet traces from this period? Are you using DCBX in your environment? Do you have any special configuration you've done to OpenLLDP? How often is this error triggered?
We need a little more information before we can investigate this issue.
Also, I should mention that AlmaLinux 8.6 is no longer in support. Currently, only 8.8 and 9 is under support.
Some more logs after turning on the verbose mode in lldpad lldpad[4711]: link status: 2 lldpad[4711]: device name: GE0-2_ll2 lldpad[4711]: *** LINK DOWN: GE0-2_ll2 lldpad[4711]: event_if_decode_nlmsg: calling ifdown for agent 0x5593648b3360. lldpad[4711]: mand_ifdown:port GE0-2_ll2 removed lldpad[4711]: basman_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: med_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: ieee8023_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: evb_ifdown:GE0-2_ll2 agent 2 called lldpad[4711]: evb_ifdown:GE0-2_ll2 agent 2 does not exist. lldpad[4711]: evb22_ifdown:GE0-2_ll2 agent 2 called lldpad[4711]: evb22_ifdown:GE0-2_ll2 agent 2 does not exist. lldpad[4711]: event_if_decode_nlmsg: calling ifdown for agent 0x5593648b3290. lldpad[4711]: mand_ifdown:port GE0-2_ll2 removed lldpad[4711]: basman_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: med_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: ieee8023_ifdown:port GE0-2_ll2 adding failed lldpad[4711]: event_if_decode_nlmsg: calling ifdown for agent 0x5593648b31c0. systemd[1]: Stopping Link Layer Discovery Protocol Agent Daemon.... lldpad[4711]: mand_ifd systemd[1]: lldpad.service: Main process exited, code=dumped, status=11/SEGV systemd[1]: lldpad.service: Failed with result 'core-dump'. systemd[1]: Stopped Link Layer Discovery Protocol Agent Daemon..
The core is generated everytime I stop the lldpad service. systemctl stop lldpad
Can you reproduce the crash on a supported release of AlmaLinux 8.8 or 9 and updated lldpad. 8.6 is no longer supported by them. Also, it would be helpful to get the core dump with the debuginfo and debugsource packages installed for analysis. You can find the lldpad debug packages for AlmaLinux 8.8 here, for example: http://repo.almalinux.org/vault/8.8/BaseOS/debug/x86_64/Packages/ Please update to a supported release of OpenLLDP from AlmaLinux and install the matching debug packages and try to reproduce the crash in that environment.
Some more logs from dmesg : [16444.078882] lldpad[117136]: segfault at 0 ip 000055e8e90ad80e sp 00007ffee81a0ea0 error 6 in lldpad[55e8e9091000+61000] [16444.078892] Code: 85 db 74 50 0f 1f 84 00 00 00 00 00 48 8b 53 78 48 8b 83 80 00 00 00 48 85 d2 74 0e 48 89 82 80 00 00 00 48 8b 83 80 00 00 00 <48> 89 10 48 89 df e8 17 fe ff ff 48 8b 7b 18 e8 7e fd ff ff 48 89
In this code, the module pointer should have a null check / BUG: need to check if tlvs are freed / void dcbx_unregister(struct lldp_module mod) { dcbx_remove_all(); deinit_drv_if(); if (mod->data) { dcbx_free_data((struct dcbd_user_data ) mod->data); free(mod->data); } free(mod); LLDPAD_DBG("%s: unregister dcbx complete.\n", func); }
I don't think that that is the cause of this crash. In every case that dcbx_register is called, it's called with this construct:
module->ops->lldp_mod_unregister(module);
Which means that module must be non-NULL in order for it to dereference it to find the lldp_mod_unregister operation which is what dcbx_unregister() is. If you want to produce a proper crash report, please install a supported version of lldap and it's matching debuginfo and debugsource packages. This will allow you to see the exact point it crashed. You will need to be using lldpad from AlmaLinux 8.8 which is a still supported release. We cannot assist you with an unsupported release like 8.6.
(gdb) bt
something strange is going on - why is module pointer valid, but data pointer isn't? Are you using the latest code? How do you reproduce this?
We disabled lldp on the hardware and enabled lldp on a veth pair. These are intel 810 cards. On execution of systemctl stop lldpad, the cores are generated. Its easily reproducible.
dcbx_free_data((struct dcbd_user_data ) mod->data). The function dcbx_free_data is called with the mod->data pointer, which is cast to (struct dcbd_user_data ). If the casted pointer is not a valid pointer to struct dcbd_user_data, it can lead to undefined behavior or a segmentation fault I guess
Is this a safer code and prevent a crash ?
/ BUG: need to check if tlvs are freed / void dcbx_unregister(struct lldp_module mod) { dcbx_remove_all(); deinit_drv_if(); if (mod->data) { dcbd_user_data_ptr = NULL; dcbd_user_data_ptr = (struct dcbd_user_data ) mod->data; if (dcbd_user_data_ptr != NULL){ LLDPAD_DBG("dcbd_user_data_ptr is valid. Calling dcbx_free_data()"); dcbx_free_data(dcbd_user_data_ptr); } if (mod->data){ LLDPAD_DBG("Calling free for mod-data"); free(mod->data); mod->data = NULL; } } free(mod); LLDPAD_DBG("%s: unregister dcbx complete.\n", func); }
The NULL check is not helping [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/lldpad -t -V9'. Program terminated with signal SIGSEGV, Segmentation fault.
372 LIST_REMOVE(dd, entry); Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-164.el8.x86_64 libconfig-1.5-9.el8.x86_64 libnl3-3.5.0-1.el8.x86_64
I did not expect that the NULL check would help as that is the same pointer that is already checked with the surrounding if(mod->data) {...}
check. If you look at the backtrack from your crash, you will see it failed inside of dcbx_free_data()
which navigates a linked list and removed the item from that list. I suspect that linked-list has somehow been corrupted and should be examined at the point of the crash.
https://github.com/intel/openlldp/blob/11171b474f6f3cbccac5d608b7f26b32ff72c651/lldp_dcbx.c#L344
Also, if you want to help debug this crash, please update to a currently supported version for AlmaLinux. We cannot assist with software bugs from an unsupported distribution release. At a minimum, you should be running the latest package of lldpad found her for your distribution:
https://repo.almalinux.org/almalinux/8.8/BaseOS/x86_64/os/Packages/
Also, please install both the debuginfo and debugsource packages matching that and use that when producing further crash dumps/backtraces. These can be found here:
https://repo.almalinux.org/vault/8.8/BaseOS/debug/x86_64/Packages/
@SrijitNair I believe we have a fix for your issue. I did not catch it until today, but I believe PR #97 which @tabraham has pushed should fix the crash that you are seeing. Can you please try to build that PR and see if it prevents the crash you are seeing?
I am trying to build the rpm from master and encountering error with the patch applied. what could be wrong ? 1 out of 13 hunks FAILED -- saving rejects to file lldp_dcbx.c.rej error: Bad exit status from /var/tmp/rpm-tmp.qlFXEt (%prep)
RPM build errors: Bad exit status from /var/tmp/rpm-tmp.qlFXEt (%prep)
@SrijitNair I believe we have a fix for your issue. I did not catch it until today, but I believe PR #97 which @tabraham has pushed should fix the crash that you are seeing. Can you please try to build that PR and see if it prevents the crash you are seeing?
How can I get this patch onto alma ?
I've just merged to master and branch-1.1 - you can try building from upstream by checking out and following the installation instructions. A tl;dr - './bootstrap.sh && ./configure && make rpm'
I've just merged to master and branch-1.1 - you can try building from upstream by checking out and following the installation instructions. A tl;dr - './bootstrap.sh && ./configure && make rpm'
Thanks. I will give that a try.
@penguin359 @orgcandman @tabraham Thanks for you support and helping me to resolve this issue. Thanks, Srijit
Hi, The lldp on my alma crashed with a core dump. The analysis gives the following trace " Core was generated by `/usr/sbin/lldpad -t'. Program terminated with signal SIGSEGV, Segmentation fault.
0 0x000056219c1628ce in dcbx_unregister ()".
we are using : lldpad-1.0.1-16.git036e314.el8.x86_64 AlmaLinux release 8.6 (Sky Tiger)
Is there a known bug and a fix ? Thanks, Srijit