tango-controls / cppTango

Moved to gitlab
http://tango-controls.org
41 stars 34 forks source link

Tango client hangs on ping #106

Closed tango-controls-bot closed 8 years ago

tango-controls-bot commented 14 years ago

Hello,

There is a situation where a tango client application will be hanged. You can reproduce this with either a C++ or Python client. To reproduce it:

  1. start any device server (preferebly in linux)

  2. do a Ctrl+Z to suspend the DS (or kill -stop <pid>)

  3. write a tango client that something like this one: import PyTango d = PyTango.DeviceProxy(<name>) d.ping()

the ping call will never return.

Reported by: tiagocoutinho

Original Ticket: tango-cs/bugs/342

tango-controls-bot commented 14 years ago

Hi Tiago,

We have to be precise here. When you do a CTRL+Z in the window where the DS is running, you send the signal SIGTSTP (number 20 on my ubuntu) to the process. When you use the kill -stop command, you send the signal SIGSTOP (number 19 on my ubuntu) to the process. In both cases, the default action is to stop the process. This process will re-start when it will receive the SIGCONT signal (18 on my ubuntu). SIGSTOP is like SIGKILL, it cannot be caught, blocked or ignored... There is nothing we can do for this one. Between the signal SIGSTOP and SIGCONT, the process is stopped and obviously, if you try to access it from a client, you will have a timeout. We can ask the OS (at least Unix like OS) to ignore signals SIGTSTP and SIGCONT but is it really wanted? If we do this, there will be noway to put a process in background in you have started it in a shell window in foreground by typing CTRL+Z and bg.

Waiting for comments

Cheers

Manu

Original comment by: taurel

tango-controls-bot commented 14 years ago

Hi Manu,

The detailed actions are:

  1. start device server TgTest (code in attachment)
  2. do a Ctrl+Z (SIGTSTP)
  3. start the client test_ping (code in attachment) the client will never return from the ping call.

you said that: "Between the signal SIGSTOP and SIGCONT, the process is stopped and obviously, if you try to access it from a client, you will have a timeout."

The problem is I DON'T have the timeout exception but instead the client waits forever. I also tried the same example with a DeviceProxy::state() and I get the same result.

I don't want to ignore the SIGTSTP in the device server. The situation we have here at ALBA is: some DS "hang" for some reason (they have bugs!). I wanted to create a tango diagnostic tool to check which DS on the machine were hanged. I am using the Ctrl-Z to simulate a "hang" in the DS. I was hoping that pinging a device in a DS which is "hanged" would trigger the tango timeout exception and I could report the DS as being "hanged". But since the ping() call does not return ever, the diagnostic tool is getting stuck in the first device which is "hanged".

However if the sequence of actions is:

  1. start device server TgTest (code in attachment) in one console
  2. start a python console and do: d = PyTango.DeviceProxy(<name>)
  3. do a Ctrl+Z (SIGTSTP) on the DS console
  4. execute in the python console: d.ping() DevError[ desc = TRANSIENT CORBA system exception: TRANSIENT_CallTimedout origin = DeviceProxy:ping reason = API_CorbaException severity = ERR] DevError[ desc = Timeout (3000 mS) exceeded on device tgtest/lothar/1 origin = DeviceProxy:ping reason = API_DeviceTimedOut severity = ERR]]

in this case the ping returns with an exception. So, it appears that the problem happens only if the DeviceProxy object is created AFTER the DS is suspended.

Original comment by: tiagocoutinho

tango-controls-bot commented 14 years ago

Original comment by: tiagocoutinho

tango-controls-bot commented 14 years ago

Original comment by: tiagocoutinho

tango-controls-bot commented 14 years ago

Hi Tiago,

I am able to re-produce this behavior in the following case:

1 - Client and server on the same host 2 - Server suspended by a CTRL^Z 3 - DeviceProxy creation followed by a ping() call

but only if the client (point 3) is started more or less one minute after the device server has been suspended.

There is a work-around for this problem. Define the environment variable ORBclientTransportRule="* tcp" for the client process. For me, it solves the problem. The drawback is that the unix socket transport will never be used (less performance for connection on the same host)

There is even a better solution: This behavior is due to a omniORB bug in connection establishment for unix socket transport. Duncan already sent me a patch for this bug. Apply this patch, re-build omniORB and it should solve the problem. It did on my ubuntu The patch file is attached to this bug report

Cheers

Manu

Original comment by: taurel

tango-controls-bot commented 14 years ago

omniORB patch

Original comment by: taurel

tango-controls-bot commented 14 years ago

Original comment by: tiagocoutinho

tango-controls-bot commented 14 years ago

Hi Manu,

I checked and I also need about 1 minute after the DS is suspended to see the problem. Thank you very much for the proposed fixes. I will apply the omniORB patch. Since the bug is not in tango but omniORB I mark the bug as closed and with resolution None. (i don't want to mark it as deleted so that we don't loose the patches in attachment)

Original comment by: tiagocoutinho