koppi / mk

LinuxCNC / Machinekit and EtherCAT notes.
GNU Affero General Public License v3.0
59 stars 24 forks source link

Segmentation fault during shutdown #4

Closed koppi closed 4 years ago

koppi commented 9 years ago
May 16 01:01:19 x200 msgd:0: normal shutdown - global segment detached
May 16 01:26:31 x200 msgd:0: startup pid=13071 flavor=posix rtlevel=1 usrlevel=1 halsize=524288 shm=Posix gcc=4.7.2 version=git not installed at configure time
May 16 01:26:31 x200 msgd:0: ØMQ=4.0.4 czmq=2.2.0 protobuf=2.4.1
May 16 01:26:31 x200 msgd:0: configured: sha=git not installed or executable
May 16 01:26:31 x200 msgd:0: built:      May 12 2015 12:42:47 sha=git not installed or executable
May 16 01:26:31 x200 msgd:0: publishing ZMQ/protobuf log messages at ipc:///tmp/0.log.a42c8c6b-4025-4f83-ba28-dad21114744a
May 16 01:26:31 x200 msgd:0: rtapi_app:13077:user accepting commands at ipc:///tmp/0.rtapi.a42c8c6b-4025-4f83-ba28-dad21114744a
May 16 01:26:31 x200 msgd:0: hal_lib:13077:rt creating ladder-state
May 16 01:26:31 x200 msgd:0: hal_lib:13101:user INFO CLASSICLADDER-   No ladder GUI requested-Realtime runs till HAL closes.

May 16 01:28:06 x200 msgd:0: rtapi_app:13077:user signal 11 - 'Segmentation fault' received, dumping core (current dir=/home/koppi/linuxcnc/configs/koppi-cnc)
May 16 01:28:06 x200 msgd:0: rtapi_app:13077:user (backtrace not available - libbacktrace not found during build)
May 16 01:28:06 x200 msgd:0: rtapi_app:13077:user signal 11 - 'Segmentation fault' received, dumping core (current dir=/home/koppi/linuxcnc/configs/koppi-cnc)
May 16 01:28:06 x200 msgd:0: rtapi_app:13077:user (backtrace not available - libbacktrace not found during build)
May 16 01:28:06 x200 msgd:0: rtapi_app exit detected - scheduled shutdown
May 16 01:28:08 x200 msgd:0: msgd shutting down
May 16 01:28:08 x200 msgd:0: log buffer hwm: 0% (4 msgs, 506 bytes out of 524288)
May 16 01:28:08 x200 msgd:0: normal shutdown - global segment detached
sirop commented 8 years ago

@koppi

That's the segfault of lcec , isn't it ? Good to know I am not the only one who sees this.

So I tried a very simple script:

$ DEBUG=5 realtime start
$ halcmd loadusr -W lcec_conf ./output_1sl.xml
<commandline>:0: Component 'lcec_conf' ready
<commandline>:0: Program 'lcec_conf' started
$  halcmd unloadusr  lcec_conf
$ halcmd loadrt  lcec
<commandline>:0: Realtime module 'lcec' loaded
$  halcmd unload  lcec
<commandline>:0: Realtime module 'lcec' unloaded
$ DEBUG=5 realtime stop
<commandline>:0: Realtime threads stopped

with gdb output:

(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007f36720df261 in hal_exit (comp_id=66) at hal/lib/hal_comp.c:312
312             retval = rtapi_shmem_delete(lib_mem_id, comp_id);
(gdb) p lib_mem_id
$1 = 1
(gdb) p comp_id
$2 = 66
(gdb) backtrace
#0  0x00007f36720df261 in hal_exit (comp_id=66) at hal/lib/hal_comp.c:312
#1  0x00007f36720d32df in rtapi_app_exit () at hal/lib/hal_lib.c:209
#2  0x00000000004081fd in do_unload_cmd (name="hal_lib", reply=..., instance=<optimized out>) at rtapi/xenomai/rtapi_app.cc:645
#3  0x00000000004086c0 in exit_actions (instance=<optimized out>) at rtapi/xenomai/rtapi_app.cc:667
#4  0x0000000000409be8 in rtapi_request (loop=<optimized out>, poller=0x24e1ff0, arg=<optimized out>) at rtapi/xenomai/rtapi_app.cc:829
#5  0x00007f36746de23e in zloop_start () from /usr/lib/x86_64-linux-gnu/libczmq.so.3
#6  0x000000000040941d in mainloop (argc=argc@entry=2, argv=argv@entry=0x7fff2c15a0f8) at rtapi/xenomai/rtapi_app.cc:1336
#7  0x00000000004040c6 in main (argc=2, argv=0x7fff2c15a0f8) at rtapi/xenomai/rtapi_app.cc:1693
sirop commented 8 years ago

Maybe, @mhaberler will see the reason of the fault at once?

mhaberler commented 8 years ago

please provide instructions to reproduce - branch, commit, any other config special hardware needed?

sirop commented 8 years ago

@mhaberler Yes, you need special hardware: one EtherCAT slave. If you need (generic) XML file for your slave as linuxcnc-ethercat demands, tell me which slave you have (vendor id, product id), then I would provide such an XML file.

I'll try to isolate the bug meanwhile as it boils down to several calls of rtapi_shmem_new and rtapi_shmem_delete by making an example component.

mhaberler commented 8 years ago

it looks like an rtapi_shmem_delete deleted the wrong segment - that of hal_lib, and that causes the crash: referencing hal data structs which are not available any more

check the segment numbers being freed

sirop commented 8 years ago

@mhaberler I added

int statusCode = rtapi_shmem_delete(shmem_id, comp_id);
rtapi_print_msg(RTAPI_MSG_INFO, LCEC_MSG_PFX "shmem del: status %d, shmem_id %d, comp_id %d\n", statusCode, shmem_id, comp_id);

at https://github.com/sittner/linuxcnc-ethercat/blob/master/src/lcec_main.c#L833 .

/var/log/linuxcnc.log shows then:

Aug 16 08:18:42 debian-master msgd:0: hal_lib:14257:rt LCEC: shmem del: status 0, shmem_id 2, comp_id 81

whereas gdb -p ... yields the same as before:

#1  0x00007fe0eb633261 in hal_exit (comp_id=66) at hal/lib/hal_comp.c:312
312             retval = rtapi_shmem_delete(lib_mem_id, comp_id);
(gdb) p comp_id
$5 = 66
(gdb) print comp_id
$6 = 66
(gdb) print lib_mem_id
$7 = 1

So the segment numbers (shmem_id) are not the same....

mhaberler commented 8 years ago

can I reproduce somehow without EtherCAT peripheral?

sirop commented 8 years ago

I'll try to put all these rtapi_shmem_new and rtapi_shmem_delete into a simple component this evening and then report.

mhaberler commented 8 years ago

super!

sirop commented 8 years ago

@mhaberler

In order to mimick linuxcnc-ethercat one needs one non RT and one RT component. These are https://github.com/sirop/Issue_koppi_mk4/blob/master/shmemTest_USR.c and https://github.com/sirop/Issue_koppi_mk4/blob/master/shmemTest.c .

Compile insructions are within the files.

Hal script: https://github.com/sirop/Issue_koppi_mk4/blob/master/shmem.hal .

The segfault occurs when exiting HAL.

/var/log/linuxcnc.log says:

Aug 16 11:04:07 debian-master msgd:0: rtapi_app:18506:user signal 11 - 'Segmentation fault' received, dumping core (current dir=/home/master/ecat_exper)
Aug 16 11:04:07 debian-master msgd:0: rtapi_app:18506:user  --- rtapi_app backtrace: ---
Aug 16 11:04:07 debian-master msgd:0: rtapi_app:18506:user 7fb70b363261 hal_exit         (hal/lib/hal_comp.c:312)
Aug 16 11:04:07 debian-master msgd:0: rtapi_app:18506:user 7fb70b3572de rtapi_app_exit   (hal/lib/hal_lib.c:209)

That is the same error at the same place hal/lib/hal_comp.c:312 as before with lcec.

mhaberler commented 8 years ago

super, will check

mhaberler commented 8 years ago

I could reproduce the issue, looking into it

mhaberler commented 8 years ago

@sittner, @sirop:

fix: #include "rtapi_app.h" in shmemTest.c and properly build the component

The build commands in the C files are incorrect and hence do not expose that the comp does not even build - integrating the component into the Submakefiles like other comps shows this:

gcc -c -O0 -g -Wall -funwind-tables -I. -I/usr/include/xenomai -D_GNU_SOURCE -D_REENTRANT -D__XENO__ -DTHREAD_FLAVOR_ID=2 -DRTAPI -D_GNU_SOURCE -D_FORTIFY_SOURCE=0 -DPB_FIELD_32BIT '-DPB_SYSTEM_HEADER=<'machinetalk'/include/pb-linuxcnc.h>' -D__MODULE__ -I. -I./libnml/linklist -I./libnml/cms -I./libnml/rcs -I./libnml/inifile -I./libnml/os_intf -I./libnml/nml -I./libnml/buffer -I./libnml/posemath -I./rtapi -I./hal/lib -I./emc/nml_intf -I./emc/kinematics -I./emc/motion -I./emc/tp -I./machinetalk/nanopb -I./machinetalk/build -DSEQUENTIAL_SUPPORT -DHAL_SUPPORT -DDYNAMIC_PLCSIZE -DRT_SUPPORT -DOLD_TIMERS_MONOS_SUPPORT -DMODBUS_IO_MASTER -mieee-fp  -fPIC hal/components/shmemTest.c -o objects/xenomai/hal/components/shmemTest.o
Linking ../rtlib/xenomai/shmemTest.so
ld -d -r -o objects/xenomai/shmemTest.tmp objects/xenomai/hal/components/shmemTest.o
objcopy -j .rtapi_export -O binary objects/xenomai/shmemTest.tmp objects/xenomai/shmemTest.exported
(echo '{ global : ';  tr -s '\0' <objects/xenomai/shmemTest.exported | xargs -r0 printf '%s;\n' | grep .; echo 'local : * ; };') > objects/xenomai/shmemTest.ver
gcc -shared -Bsymbolic -L/home/mah/machinekit-check/lib -Wl,-rpath,/home/mah/machinekit-check/lib -Wl,--no-as-needed -Wl,--version-script,objects/xenomai/shmemTest.ver -o ../rtlib/xenomai/shmemTest.so objects/xenomai/hal/components/shmemTest.o 
/usr/bin/ld:objects/xenomai/shmemTest.ver:2: syntax error in VERSION script
collect2: error: ld returned 1 exit status
Makefile:1599: recipe for target '../rtlib/xenomai/shmemTest.so' failed
make[1]: *** [../rtlib/xenomai/shmemTest.so] Error 1

test branch: https://github.com/mhaberler/machinekit/tree/koppi-shmem

non-fatal: shm keys are only significant in the lower 24bits, please see https://github.com/machinekit/machinekit/blob/master/src/rtapi/rtapi_shmkeys.h

sirop commented 8 years ago

fix: #include "rtapi_app.h" in shmemTest.c

That was my fault as "rtapi_app.h" is there: https://github.com/sittner/linuxcnc-ethercat/blob/master/src/lcec_main.c#L42 .

The build commands in the C files are incorrect and hence do not expose the comp ...

Nevertheless shmemTest.c can be built with the build commands in its C file. What does "do not expose the comp" mean?

mhaberler commented 8 years ago

please look at the buid log above, and what the build does, and compare to your build command - all you do is compile and create a .so

you are missing essential steps: objcopy - extraction of symbols in the .rtapi_export section, creation of a linker script fragment, final link

I think using the out-of-tree building Makefile.modinc should take care of that

text: should have been "that the comp does not even build"

sirop commented 6 years ago

Seems to be solved through https://github.com/sittner/linuxcnc-ethercat/issues/49 and https://github.com/sittner/linuxcnc-ethercat/pull/56 ?