tango-controls / cppTango

Moved to gitlab
http://tango-controls.org
41 stars 34 forks source link

5 seconds timeout when starting a device server in a subprocess #299

Closed tango-controls-bot closed 7 years ago

tango-controls-bot commented 7 years ago

While I was trying to write some unit-tests for pytango, I noticed an intriguing timeout issue in the C++ API: it takes a long time to start a device server in a subprocess after some unrelated tango calls in the parent process. More precisely, the Util object instanciation takes 5 seconds.

Here's the code to reproduce this issue:

import os
import tango
from tango.server import Device, DeviceMeta

class Test(Device):
    __metaclass__ = DeviceMeta

# Unrelated tango call
db = tango.Database()
# Run the server in a subprocess (using fork)
os.wait() if os.fork() else Test.run_server()

I'm fairly confident the equivalent C++ code will produce the same issue.

Reported by: vxgmichel

Original Ticket: tango-cs/bugs/819

tango-controls-bot commented 7 years ago

Dear Vincent,

thanks for the bug report. Actually, it seems to be a bad idea to do a fork in a multithreaded program as stated on this link: http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them. If you are using fork, you should rather start your device server using one of the functions from the exec family. This is what the Starter device server is doing for instance. (Starter is actually doing a double fork + execxxx) In the example you provided, the creation of the Database object will initialize CORBA (as a client) and connect to the database CORBA object. This will create some CORBA threads. omniORB will create a special thread called the Scavenger thread which will scan the CORBA threads for idle connections (every 5 s by default, defined by ORBscanGranularity parameter) and kill the idle connection threads automatically after a while (180s or 120s by default). When the device server is started, Tango tries to destroy the ORB which was previously created, because it was created for a CORBA client only. We need an ORB for a CORBA server in this case. The 5 seconds timeout you are observing is happening when Tango tries to destroy the previously created ORB. omniORB is then waiting for each previoulsy created thread to stop with a timeout corresponding to the ORBscanGranularity parameter (5 seconds by default). But because of what is described in the link I provided before, especially because of some critical sections/mutexes, the child process cannot stop the threads which were created by the parent process so it has to wait until the end of this timeout. One work-around to remove this 5 seconds waiting time is to set ORBscanGranularity environment variable to 0. This will disable the omniORB ScanAvenger thread and make the ORB destroy fast BUT you have to be conscious that idle connection threads won't be removed any longer if you do that! This might be acceptable in your use case, since this is for unit tests but please be careful when using that. Please be aware that passing -ORBscanGranularity 0 as argin parameter when starting the device server will not change the behaviour since this parameter will be taken into account only after the previous ORB has been destroyed. You would still get the timeout in this case. Using the ORBscanGranularity environment variable should work to make this 5 seconds waiting time disappear. What you could do as well is setting ORBscanGranularity to 0 and pass the -ORBscanGranularity argin parameter set to 5 at device server creation. (I am not familiar with PyTango but I guess there should be a way to pass this kind of parameter at device server creation.) This way, the device server will still run with the Scavenger thread and idle connections threads will still be automatically removed for your device server.

Hoping this helps a bit. Kind regards

Reynald (with the help of Manu for trying to understand this issue).

Original comment by: bourtemb