ipbus / ipbus-software

Software that implements a reliable high-performance control link for particle physics electronics, based on the IPbus protocol
https://ipbus.web.cern.ch
GNU General Public License v3.0
22 stars 24 forks source link

controlhub.service start operation timed out #219

Open plaurens opened 3 years ago

plaurens commented 3 years ago

Could someone please help us diagnose out problem with our ipbus installation. We had been using IPbus success successfully on this machine for years. After a recent reboot, the server is no longer able to start, seemingly failing to bind to its targeted port.

We have tried many things with no success (Rebooting, un-installing/re-installing ipbus, removing ipv6, and no other processes are interfering with port 10203)

Is there a known problem and workaround, or what can we do to gain more clues on the issue?

[root@hubdev ~]# systemctl status controlhub ● controlhub.service - IPbus ControlHub server daemon (mediates simultanous IPbus-based hardware access from multiple control applications) Loaded: loaded (/usr/lib/systemd/system/controlhub.service; enabled; vendor preset: disabled) Active: failed (Result: timeout) since Tue 2020-10-27 19:51:10 EDT; 5 days ago Docs: http://ipbus.web.cern.ch/ipbus/doc/index.html Process: 1094 ExecStart=/opt/cactus/bin/controlhub_start (code=killed, signal=TERM)

Oct 27 19:49:49 hubdev.pa.msu.edu controlhub[1359]: 2020-10-27 19:49:49.117 [notice] <0.1084.0> - Starting the ControlHub application (version 2.5.2). Oct 27 19:49:49 hubdev.pa.msu.edu controlhub[1359]: 2020-10-27 19:49:49.117 [notice] <0.1085.0> - ControlHub root supervisor starting now. Oct 27 19:49:49 hubdev.pa.msu.edu controlhub[1359]: 2020-10-27 19:49:49.117 [notice] <0.1086.0> - Initialising the stats server. Oct 27 19:49:49 hubdev.pa.msu.edu controlhub[1359]: 2020-10-27 19:49:49.117 [notice] <0.1087.0> - Initialising the device client registry. Oct 27 19:49:49 hubdev.pa.msu.edu controlhub[1359]: 2020-10-27 19:49:49.117 [notice] <0.1088.0> - Initialising the TCP listener (port 10203). Oct 27 19:51:10 hubdev.pa.msu.edu systemd[1]: controlhub.service start operation timed out. Terminating. Oct 27 19:51:10 hubdev.pa.msu.edu controlhub_start[1094]: Starting ControlHub .......................... Oct 27 19:51:10 hubdev.pa.msu.edu systemd[1]: Failed to start IPbus ControlHub server daemon (mediates simultanous IPbus-based hardware access from multiple control...lications). Oct 27 19:51:10 hubdev.pa.msu.edu systemd[1]: Unit controlhub.service entered failed state. Oct 27 19:51:10 hubdev.pa.msu.edu systemd[1]: controlhub.service failed. Hint: Some lines were ellipsized, use -l to show in full.

This is a RedHat 7

[root@hubdev ~]# cat /etc/redhat-release Red Hat Enterprise Linux Workstation release 7.9 (Maipo)

And we are using this version of ipbus

cactuscore-controlhub x86_64 2.5.2-1.centos7 ipbus-sw-updates 17 M cactuscore-extern-pugixml x86_64 1.2-0.centos7 ipbus-sw-base 90 k cactuscore-uhal-grammars x86_64 2.5.0-1.centos7 ipbus-sw-base 1.2 M cactuscore-uhal-gui x86_64 2.3.0-0.centos7.python2.7 ipbus-sw-base 39 k cactuscore-uhal-log x86_64 2.5.0-1.centos7 ipbus-sw-base 293 k cactuscore-uhal-pycohal x86_64 2.5.1-1.centos7.python2.7 ipbus-sw-updates 1.2 M cactuscore-uhal-tests x86_64 2.5.1-1.centos7 ipbus-sw-updates 1.0 M cactuscore-uhal-tools noarch 2.5.1-1.centos7 ipbus-sw-updates 7.9 k cactuscore-uhal-uhal x86_64 2.5.1-1.centos7 ipbus-sw-updates 3.4 M

Thanks, Philipppe

tswilliams commented 3 years ago

Hi,

This is indeed strange. I'm not sure what could be causing the issue, but version 2.5.1 is a few years old now and so to help with debugging can you try updating to the latest IPbus software release, v2.7.8? The instructions can be found here: https://ipbus.web.cern.ch/doc/user/html/software/install/yum.html

Cheers, Tom

plaurens commented 3 years ago

Thank you Tom,

We have updated as you suggested

Updating: cactuscore-controlhub x86_64 2.7.0-1.centos7 ipbus-sw-base 19 M cactuscore-uhal-grammars x86_64 2.7.9-1.centos7.gcc4_8_5 ipbus-sw-updates 79 k cactuscore-uhal-gui x86_64 2.7.9-1.centos7.gcc4_8_5.python2.7 ipbus-sw-updates 39 k cactuscore-uhal-log x86_64 2.7.9-1.centos7.gcc4_8_5 ipbus-sw-updates 43 k cactuscore-uhal-pycohal x86_64 2.7.9-1.centos7.gcc4_8_5.python2.7 ipbus-sw-updates 1.1 M cactuscore-uhal-tests x86_64 2.7.9-1.centos7.gcc4_8_5 ipbus-sw-updates 721 k cactuscore-uhal-tools noarch 2.7.9-1.centos7.gcc4_8_5 ipbus-sw-updates 8.9 k cactuscore-uhal-uhal x86_64 2.7.9-1.centos7.gcc4_8_5 ipbus-sw-updates 520 k Installing for dependencies: pugixml x86_64 1.8-1.el7 epel 94 k wxGTK-gl x86_64 2.8.12-20.el7 epel 34 k

but the symptoms have not changed.

[root@hubdev ~]# systemctl status controlhub ● controlhub.service - IPbus ControlHub server daemon (mediates simultanous IPbus-based hardware access from multiple control applications) Loaded: loaded (/usr/lib/systemd/system/controlhub.service; enabled; vendor preset: disabled) Active: failed (Result: timeout) since Tue 2020-11-10 18:41:05 EST; 3min 7s ago Docs: http://ipbus.web.cern.ch/ipbus/doc/index.html Process: 15582 ExecStart=/opt/cactus/bin/controlhub_start (code=killed, signal=TERM)

Nov 10 18:39:44 hubdev.pa.msu.edu controlhub[15767]: 2020-11-10 18:39:44.448 [notice] <0.1072.0> - Starting the ControlHub application (version 2.6.0). Nov 10 18:39:44 hubdev.pa.msu.edu controlhub[15767]: 2020-11-10 18:39:44.448 [notice] <0.1073.0> - ControlHub root supervisor starting now. Nov 10 18:39:44 hubdev.pa.msu.edu controlhub[15767]: 2020-11-10 18:39:44.448 [notice] <0.1074.0> - Initialising the stats server. Nov 10 18:39:44 hubdev.pa.msu.edu controlhub[15767]: 2020-11-10 18:39:44.448 [notice] <0.1075.0> - Initialising the device client registry. Nov 10 18:39:44 hubdev.pa.msu.edu controlhub[15767]: 2020-11-10 18:39:44.448 [notice] <0.1076.0> - Initialising the TCP listener (port 10203). Nov 10 18:41:05 hubdev.pa.msu.edu systemd[1]: controlhub.service start operation timed out. Terminating. Nov 10 18:41:05 hubdev.pa.msu.edu controlhub_start[15582]: Starting ControlHub ................ Nov 10 18:41:05 hubdev.pa.msu.edu systemd[1]: Failed to start IPbus ControlHub server daemon (mediates simultanous IPbus-based hardware access from multiple control applications). Nov 10 18:41:05 hubdev.pa.msu.edu systemd[1]: Unit controlhub.service entered failed state. Nov 10 18:41:05 hubdev.pa.msu.edu systemd[1]: controlhub.service failed.

Is there some verbose or debut mode to extract more hints about the underlying issue?

Thanks, Philippe

tswilliams commented 3 years ago

Hi,

Could you try running ps aux | grep /opt/cactus, then run /opt/cactus/bin/controlhub_start, then run ps aux | grep /opt/cactus again, and send me the output of those commands? (The controlhub_start script is the script that's called by systemctl behind the scenes.)

Cheers, Tom

plaurens commented 3 years ago

Hi Tom,

Here is the requested info, with confirmation of rpm versions

[root@hubdev ~]# yum list cactus* Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager Installed Packages cactuscore-controlhub.x86_64 2.7.0-1.centos7 @ipbus-sw-base
cactuscore-extern-pugixml.x86_64 1.2-0.centos7 @ipbus-sw-base
cactuscore-uhal-grammars.x86_64 2.7.9-1.centos7.gcc4_8_5 @ipbus-sw-updates cactuscore-uhal-gui.x86_64 2.7.9-1.centos7.gcc4_8_5.python2.7 @ipbus-sw-updates cactuscore-uhal-log.x86_64 2.7.9-1.centos7.gcc4_8_5 @ipbus-sw-updates cactuscore-uhal-pycohal.x86_64 2.7.9-1.centos7.gcc4_8_5.python2.7 @ipbus-sw-updates cactuscore-uhal-tests.x86_64 2.7.9-1.centos7.gcc4_8_5 @ipbus-sw-updates cactuscore-uhal-tools.noarch 2.7.9-1.centos7.gcc4_8_5 @ipbus-sw-updates cactuscore-uhal-uhal.x86_64 2.7.9-1.centos7.gcc4_8_5 @ipbus-sw-updates Available Packages cactuscore-uhal-grammars-debuginfo.x86_64 2.7.9-1.centos7.gcc4_8_5 ipbus-sw-updates cactuscore-uhal-log-debuginfo.x86_64 2.7.9-1.centos7.gcc4_8_5 ipbus-sw-updates cactuscore-uhal-tests-debuginfo.x86_64 2.7.9-1.centos7.gcc4_8_5 ipbus-sw-updates cactuscore-uhal-uhal-debuginfo.x86_64 2.7.9-1.centos7.gcc4_8_5 ipbus-sw-updates

[root@hubdev ~]# ps aux | grep /opt/cactus | grep -v grep hubuser 6090 0.0 0.0 72508 784 ? Ss Oct29 0:06 /usr/bin/ssh-agent /bin/bash -c exec -l "/bin/bash" -c "/usr/bin/env LD_LIBRARY_PATH=/opt/cactus/lib:/opt/cactus/lib: mate-session"

[root@hubdev ~]# /opt/cactus/bin/controlhub_start Using config file: /etc/controlhub/default/sys.config Starting ControlHub ................................FAILED!

[root@hubdev ~]# ps aux | grep /opt/cactus | grep -v grep hubuser 6090 0.0 0.0 72508 784 ? Ss Oct29 0:06 /usr/bin/ssh-agent /bin/bash -c exec -l "/bin/bash" -c "/usr/bin/env LD_LIBRARY_PATH=/opt/cactus/lib:/opt/cactus/lib: mate-session" root 6932 0.0 0.0 35200 448 ? S 10:27 0:00 /opt/cactus/lib/controlhub/erts-5.10.4/bin/epmd -daemon root 7055 0.2 0.1 768360 72592 ? Sl 10:27 0:03 /opt/cactus/lib/controlhub/erts-5.10.4/bin/beam.smp -zdbbl 2097151 -sbt db -scl false -sbwt very_long -swt low -- -root /opt/cactus/lib/controlhub -progname controlhub -- -home /root -- -boot /opt/cactus/lib/controlhub/releases/2.6.0/controlhub -mode embedded -config /etc/controlhub/default/sys.config -controlhub config_file "/etc/controlhub/default/sys.config" -sname controlhub -setcookie ch_cookie -- console -noshell -noinput ''

Thanks,

Philippe

tswilliams commented 3 years ago

Hi,

From the output of the ps aux command, it appears that the ControlHub is actually running - it may just be that the startup script used to 'ping' the ControlHub is malfunctioning in your case (this script communicates with the ControlHub through a different route than that used by uHAL). Can you try communicating with hardware via the ControlHub (despite the 'startup' error message) and let me know if you encounter problems?

Cheers, Tom

plaurens commented 3 years ago

Thank you Tom,

Before trying an ipbus cycle, I re-checked with 'ps' and the controlhub process was NOT present.

We rebooted on Friday 13-Nov-2020 and systemd had tried to restart the process, with a failed status. So, if the service process had managed to start (4 days ago) that process had died since then.

The service was started manually again today (via controlhub_start) and again reported a failure. But the controlhub process was (again) present afterward. And a test ipbus read was indeed successful. It has now been 7h and the service is still running.

So I tried comparing starting via 'systemctl' versus via 'controlhub_start' to see if there is a different outcome. And it seems clear that systemctl does not leave any cactus process running after deciding the start has failed.
I am attaching a succession of ps outputs while systemctl is trying to start the service. 20201117_systemctl_start_controlhub.log

We now have a workaround, thanks.

If you would like to try and understand where this mystery comes from don't hesitate to suggest some other test you'd like us to run.

Thanks, Philippe

tswilliams commented 3 years ago

Hi,

I'm glad to hear that this workaround works for you. After some more debugging, the only thing that I can think of that could be causing this issue is a firewall (or similar) blocking the TCP connection to the epmd service (port 4369) which is used for communication with the ControlHub as part of the 'status checks' performed by the controlhub_start script. Were there any firewall changes on your machine that could block traffic to that port (over the loopback interface)? Unfortunately as this problem hasn't been reported elsewhere I haven't been able to think of any other possible causes.

Cheers, Tom