machinekit / machinekit-hal

Universal framework for machine control based on Hardware Abstraction Layer principle
https://www.machinekit.io
Other
108 stars 62 forks source link

Segmentation fault during init when loading xhc-hb04 #209

Open fishpepper opened 5 years ago

fishpepper commented 5 years ago

I am trying to add a hb04 wheel to my cnc config. Machinekit is running on a BBB. I get a Segmentation fault when /usr/bin/haltcl tries to start.

I tried to run the sample config: `machinekit -v -V /usr/share/linuxcnc/examples/sample-configs/sim/axis/xhc-hb04/xhc-hb04-layout1.ini Verbose mode on halcmd 'print lots of junk' mode on RUN_IN_PLACE=no LINUXCNC_DIR= LINUXCNC_BIN_DIR=/usr/bin LINUXCNC_TCL_DIR=/usr/lib/tcltk/linuxcnc LINUXCNC_SCRIPT_DIR= LINUXCNC_RTLIB_DIR=/usr/lib/linuxcnc LINUXCNC_CONFIG_DIR= LINUXCNC_LANG_DIR=/usr/share/linuxcnc/tcl/msgs INIVAR=/usr/libexec/linuxcnc/inivar HALCMD=halcmd -V LINUXCNC_EMCSH=/usr/bin/wish8.6 MACHINEKIT - 0.1 Machine configuration directory is '/usr/share/linuxcnc/examples/sample-configs/sim/axis/xhc-hb04' Machine configuration file is 'xhc-hb04-layout1.ini' INIFILE=/usr/share/linuxcnc/examples/sample-configs/sim/axis/xhc-hb04/xhc-hb04-layout1.ini PARAMETER_FILE=sim-9axis.var TASK=milltask HALUI=halui DISPLAY=axis Starting Machinekit... Starting Machinekit server program: linuxcncsvr Loading Real Time OS, RTAPI, and HAL_LIB modules rtapi_msgd command: /usr/libexec/linuxcnc/rtapi_msgd --instance=0 --rtmsglevel=1 --usrmsglevel=1 --halsize=524288 rtapi_app command: /usr/libexec/linuxcnc/rtapi_app_xenomai --instance=0 Starting Machinekit IO program: io io started halcmd loadusr io started Starting HAL User Interface program: halui haltcl -i /usr/share/linuxcnc/examples/sample-configs/sim/axis/xhc-hb04/xhc-hb04-layout1.ini core_sim9.hal /usr/bin/linuxcnc: line 769: 2599 Segmentation fault haltcl -i "$INIFILE" $CFGFILE Shutting down and cleaning up Machinekit... Killing task linuxcncsvr, PID=2529 Removing HAL_LIB, RTAPI, and Real Time OS modules Removing NML shared memory segments Cleanup done Machinekit terminated with an error. For simple cases more information can be found in the following files: /home/machinekit/linuxcnc_debug.txt /home/machinekit/linuxcnc_print.txt

For other cases get more meaningfull information by restarting after export DEBUG=5

and look at the output of: /var/log/linuxcnc.log dmesg

When looking for errors, specifically look for libraries that fail to load by looking for lines with 'insmod failed' as per example below.

insmod failed, returned -1: do_load_cmd: dlopen: nonexistant-component.so: cannot open shared object file: No such file or directory

For getting help, please have a look here: www.machinekit.io/docs/getting-help/`

When I trace the tcl calls using gdb I end up here: `gdb tclsh8.6 GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1 Copyright (C) 2014 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "arm-linux-gnueabihf". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from tclsh8.6...(no debugging symbols found)...done. (gdb) run /usr/bin/haltcl -i /home/machinekit/machinekit/configs/ARM.BeagleBone.Panther/tinyBEE.ini tinyBEE.hal Starting program: /usr/bin/tclsh8.6 /usr/bin/haltcl -i /home/machinekit/machinekit/configs/ARM.BeagleBone.Panther/tinyBEE.ini tinyBEE.hal [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1". [New Thread 0xb6b62460 (LWP 2457)]

Program received signal SIGSEGV, Segmentation fault. 0x00003c88 in ?? () (gdb) bt

0 0x00003c88 in ?? ()

1 0xb62e1ece in ?? () from /usr/lib/tcltk/linuxcnc/hal.so

2 0xb62e1f7a in ?? () from /usr/lib/tcltk/linuxcnc/hal.so

3 0xb62e2114 in Hal_Init () from /usr/lib/tcltk/linuxcnc/hal.so

4 0xb6f4b19c in ?? () from /usr/lib/arm-linux-gnueabihf/libtcl8.6.so

Backtrace stopped: previous frame identical to this frame (corrupt stack?)`

I can also trigger the segfault by calling haltcl -i /usr/share/linuxcnc/examples/sample-configs/sim/axis/xhc-hb04/xhc-hb04-layout1.ini core_sim9.hal manually from the shell...

Any clues/hints where I could start digging?

fishpepper commented 5 years ago

As my old installation was quite outdated i did setup a fresh debian stretch based system. Unfortunately that one crashes with the same error. I can not run the simulation examples...

luminize commented 5 years ago

You should first get a debug log by export DEBUG=5, restart have a look at /var/log/linuxcnc.log as stated in the error message. Then share that sessions output so we can see what's going on. Put it in a gist or pastebin and link to it.

fishpepper commented 5 years ago

Ok. These files have been created on my fresh debian stretch based install on a BBB.

version output: https://pastebin.com/Ld0qreS0

console output with debug=5: https://pastebin.com/uKxzQfu6

/var/log/linuxcnc.log http://paste.debian.net/1084174/

ArcEye commented 5 years ago

Can you leave this with us for a bit.

When I first tried running the sim, I got a tclStubsPtr error with machinekit, which I thought I had fixed ages ago. For quite some while the whole xhc-hb04 thing did not work, because of a change in HAL output, but it certainly was working after I made some changes to it.

Switching to machinekit-hal and reinstating some stuff there and getting haltcl built properly, I can get my xhc-hb04 working with the sim.

Need some more time looking at it and look into why hal.so has different symbol exports in machinekit and machinekit-hal.

fishpepper commented 5 years ago

ok nice... I am glad that you could reproduce it. I was about to build machinekit from source and enable debugging symbols. It is still building, I had to add a sd card as storage and swap sapce...

what is the correct / most updated version to build machinekit from source? plain machinekit or the newer (?) version split in machinkit-hal and -cnc ?

fishpepper commented 5 years ago

My machinekit compiled sucessfully... At least I have debug symbols now :) I was wondering, can I call haltcl when no machinekit is running?

If I do so it segfaults as well:

machinekit@beaglebone:/mnt/machinekit$ gdb tclsh8.6 
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
...
Reading symbols from tclsh8.6...(no debugging symbols found)...done.
(gdb) run /mnt/machinekit/scripts/haltcl
Starting program: /usr/bin/tclsh8.6 /mnt/machinekit/scripts/haltcl
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00003c64 in ?? ()
(gdb) bt
#0  0x00003c64 in ?? ()
#1  0xb6ace3b6 in getuuid () at hal/utils/halsh.c:50
#2  0xb6ace462 in init () at hal/utils/halsh.c:70
#3  0xb6ace5f8 in Hal_Init (interp=0x41c7b8) at hal/utils/halsh.c:116
#4  0xb6f529a6 in ?? () from /usr/lib/arm-linux-gnueabihf/libtcl8.6.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) q

Looks like it crashes in this line: if ((mkinifile = getenv(mkini)) == NULL) mkini is

(gdb) p mkini
$1 = 0xb6ad51cc "MACHINEKIT_INI"

MACHINEKIT_INI env is set?!

ArcEye commented 5 years ago

... I am glad that you could reproduce it.

I didn't reproduce your error, I got another one which I have to get passed first :(

what is the correct / most updated version to build machinekit from source?

Just use the machinekit repo, the other repos are not finalised yet.

ArcEye commented 5 years ago

Now found why a problem I had fixed had occurred again. :smiling_imp:

This commit removed the linkage against libtclstub.a because it produced warnings of possible unused symbols. https://github.com/machinekit/machinekit/commit/d1fed9cc341c1f67b054c466f185c1b430092293#diff-b95f3ce6cb4f8632848c66ea2be6b1f5

What that did was remove the function that several tcl functions rely upon.

Reverting machinekit, so that it has the same linkage in tcl/hal.so that machinekit-hal does, removes the 'undefined reference to tclStubPtr' error.

When I have done a PR and new packages are available, I will advise you to please test. I suspect the error you had may be a consequential one, it certainly does not occur now on amd_64

image

fishpepper commented 5 years ago

I already figured that out as well, tclStubsPtr was a good hint. Indeed the linking is missing the tcl stubs lib.

With this fix haltcl does not segfault any more. I can not (yet) test the hb04 stuff, it is still compiling...

diff --git a/src/hal/utils/Submakefile b/src/hal/utils/Submakefile
index 862bc3811..d6931af5b 100644
--- a/src/hal/utils/Submakefile
+++ b/src/hal/utils/Submakefile
@@ -16,7 +16,7 @@ $(call TOOBJSDEPS, hal/utils/halsh.c) : EXTRAFLAGS += $(TCL_CFLAGS)
        ../lib/libmtalk.so.0 \
        ../lib/librtapi_math.so.0
        $(ECHO) Linking $(notdir $@)
-       $(Q)$(CC) $(LDFLAGS) -shared $^  -o $@
+       $(Q)$(CC) $(LDFLAGS) -shared $^ $(TCL_LIBS) -o $@
 TARGETS += ../tcl/hal.so

 $(call TOOBJSDEPS, $(HALCMDCCSRCS)) : EXTRAFLAGS =  \

I would assume this is the root of my problem, haltcl seems to crash. I will test it as soon as the compilation finishes. It takes quite some time as I had to create a swap file on a sd card.

fishpepper commented 5 years ago

It just finished compiling while i wrote this. I can not physically connect the hb04 right now as I only have remote shell access to the bbb right now. But it looks like the problem is solved:

pin_number=50 direction=out (pin_name=P9.14 gpio_name=gpio1.18)
pin_number=51 direction=out (pin_name=P9.16 gpio_name=gpio1.19)
pin_number=5 direction=out (pin_name=p9.17 gpio_name=gpio0.5)
pin_number=4 direction=out (pin_name=p9.18 gpio_name=gpio0.4)
xhc-hb04: waiting for XHC-HB04 device
Waiting for component 'xhc-hb04' to become ready.....................

The crash was before I got any xhc-hb04 output at all. I will do a proper test tonight.

ArcEye commented 5 years ago

https://github.com/machinekit/machinekit/pull/1483 refers

Should be available in the repo by the evening

fishpepper commented 5 years ago

It loads and seems to work. I can not test it properly because my stretch based image is unbelievable slow. Machinekit gui shows up after ~5 minutes of waiting. Not sure if this is caused by running it from rip (sd card) or it has something to do with the kernel (4.4.). I need to inverstigate that issue further ;)

ArcEye commented 5 years ago

Just tested again with machinekit-hal / machinekit-cnc RIP.

The issue of the monitor function not working was related to the build vars not all being exported and now is fixed in that too.

If you are happy your original issue is resolved, please close this.

zultron commented 5 years ago

;( There goes @ArcEye again fixing my mistakes. Thanks to the both of you.

fishpepper commented 5 years ago

Sorry for taking so long, I was finally able to test the code... I had to send my hb04 back to the seller and wait for the delivery of a new one from china, the first one had a really crappy plastic encoder wheel.

The good news: The segfault is definitely gone (as expected).

The bad news: When running an up to date machinekit under jessie (my current setup) it shuts down durign startup with hb04 enabled. Everything looks good and suddenly it decides to exit:

Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user halg_signal_new:22 HAL: creating signal 'pendant:jog-spindle2'
Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user halg_link:217 HAL: linking pin 'halui.spindle-override.value' to 'pendant:jog-spindle2'
Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user propagate_barriers_cb:151 HAL: propagating barriers from signal 'pendant:jog-spindle2' to pin 'halui.spindle-override.value': rmb: 0->0  wmb: 0->0
Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user halg_link:217 HAL: linking pin 'xhc-hb04.spindle-override' to 'pendant:jog-spindle2'
Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user propagate_barriers_cb:151 HAL: propagating barriers from signal 'pendant:jog-spindle2' to pin 'halui.spindle-override.value': rmb: 0->0  wmb: 0->0
Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user propagate_barriers_cb:151 HAL: propagating barriers from signal 'pendant:jog-spindle2' to pin 'xhc-hb04.spindle-override': rmb: 0->0  wmb: 0->0
Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user halg_signal_new:22 HAL: creating signal 'pendant:spindle-rps'
Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user halg_link:217 HAL: linking pin 'motion.spindle-speed-cmd-rps' to 'pendant:spindle-rps'
Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user propagate_barriers_cb:151 HAL: propagating barriers from signal 'pendant:spindle-rps' to pin 'motion.spindle-speed-cmd-rps': rmb: 0->0  wmb: 0->0
Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user halg_link:217 HAL: linking pin 'xhc-hb04.spindle-rps' to 'pendant:spindle-rps'
Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user propagate_barriers_cb:151 HAL: propagating barriers from signal 'pendant:spindle-rps' to pin 'motion.spindle-speed-cmd-rps': rmb: 0->0  wmb: 0->0
Jun 16 19:27:36 beaglebone msgd:0: hal_lib:32087:user propagate_barriers_cb:151 HAL: propagating barriers from signal 'pendant:spindle-rps' to pin 'xhc-hb04.spindle-rps': rmb: 0->0  wmb: 0->0
Jun 16 19:27:38 beaglebone rtapi:0: 4:rtapi_app:32046:user pid=32046 flavor=xenomai gcc=4.9.4 git=v0.1~-detached~84d06ae
Jun 16 19:27:38 beaglebone msgd:0: ulapi:32113:user _ulapi_init(): ulapi xenomai v0.1~-detached~84d06ae loaded
Jun 16 19:27:38 beaglebone msgd:0: ulapi:32113:user halg_xinitfv:271 HAL: singleton component 'hal_lib32113' id=1247 initialized
Jun 16 19:27:38 beaglebone msgd:0: hal_lib:32113:user --halcmd stop
Jun 16 19:27:38 beaglebone msgd:0: hal_lib:32113:user hal_stop_threads:360 HAL: threads stopped
Jun 16 19:27:38 beaglebone msgd:0: hal_lib:32113:user halg_exit:293 HAL: removing component 1249 'halcmd32113'
Jun 16 19:27:38 beaglebone msgd:0: hal_lib:32113:user ulapi_hal_lib_cleanup:235 HAL: lib_module_id=1247
Jun 16 19:27:38 beaglebone msgd:0: hal_lib:32113:user halg_exit:293 HAL: removing component 1247 'hal_lib32113'
Jun 16 19:27:38 beaglebone msgd:0: hal_lib:32113:user halg_exit:315 HAL: hal_errorcount()=0
Jun 16 19:27:38 beaglebone msgd:0: hal_lib:32113:user halg_exit:316 HAL: _halerrno=0
Jun 16 19:27:38 beaglebone rtapi:0: 4:rtapi_app:32046:user pid=32046 flavor=xenomai gcc=4.9.4 git=v0.1~-detached~84d06ae

full log: http://s000.tinyupload.com/index.php?file_id=47319323634682417837

When I run it on my stretch based image machinekit starts up as it should... Any hints on that?

ArcEye commented 5 years ago

I don't have anything to test on / with for xenomai and Jessie. Runs fine on sid / Buster / Stretch on amd64

Is strange, no error or warning, just halcmd stop.

fishpepper commented 5 years ago

Thats really strange... The machinekit binary on stretch is self compiled at commit 50e105 and jessie is from the rcn debian repo at commit 84d06ae. I will investigate that further.