Open nxtruong opened 9 years ago
Another issue I am facing, seemingly related to overloading the nameserver. My system has hundreds of ports. At some point, when I request to connect two ports, the request always fails even though both ports exist. The error message is the same: No connection to nameserver
(though the nameserver is still running fine). Below is the verbose message from Yarp for that connection error, in case it helps. You can see the connection error to the nameserver. The error always happened to the same port every time, while hundreds of other connections have been established successfully before it.
yarp(74598300): Child thread initializing
yarp(12c6d000): Thread starting up
yarp(74598300): Child thread initialized ok
yarp: Port /powernet/_smn_/user47_1 active at tcp://128.178.5.100:10535
yarp(74598300): query name /powernet/user47_1/_gc_
yarp(74598300): Configuration file: /Users/truong/Library/Application Support/yarp/config/yarp_namespace.conf
yarp(74598300): sending to nameserver: NAME_SERVER query /powernet/user47_1/_gc_
yarp(74598300): connecting to /128.178.5.139:10000
yarp(74598300): <<< received from nameserver: registration name /powernet/user47_1/_gc_ ip 128.178.5.100 port 10371 type tcp
*** end of message
yarp(74598300): sending to nameserver: NAME_SERVER query /powernet/user47_1/_gc_
yarp(74598300): connecting to /128.178.5.139:10000
yarp(74598300): <<< received from nameserver: registration name /powernet/user47_1/_gc_ ip 128.178.5.100 port 10371 type tcp
*** end of message
yarp(74598300): working on connection /powernet/_smn_/user47_1 to /powernet/user47_1/_gc_ (connect)
yarp(74598300): query name /powernet/_smn_/user47_1
yarp(74598300): Configuration file: /Users/truong/Library/Application Support/yarp/config/yarp_namespace.conf
yarp(74598300): sending to nameserver: NAME_SERVER query /powernet/_smn_/user47_1
yarp(74598300): connecting to /128.178.5.139:10000
yarp(74598300): <<< received from nameserver: registration name /powernet/_smn_/user47_1 ip 128.178.5.100 port 10535 type tcp
*** end of message
yarp(74598300): query name /powernet/user47_1/_gc_
yarp(74598300): Configuration file: /Users/truong/Library/Application Support/yarp/config/yarp_namespace.conf
yarp(74598300): sending to nameserver: NAME_SERVER query /powernet/user47_1/_gc_
yarp(74598300): connecting to /128.178.5.139:10000
yarp(74598300): <<< received from nameserver: registration name /powernet/user47_1/_gc_ ip 128.178.5.100 port 10371 type tcp
*** end of message
yarp(74598300): ** asking tcp://powernet/_smn_/user47_1: [add] "tcp://powernet/user47_1/_gc_"
yarp(12c6d000): /powernet/_smn_/user47_1: PortCore received something
yarp(12c6d000): new input connection to /powernet/_smn_/user47_1 starting
yarp(12c6d000): Child thread initializing
yarp(12cf0000): Thread starting up
yarp(12c6d000): Child thread initialized ok
yarp(12c6d000): new input connection to /powernet/_smn_/user47_1 started ok
yarp(12c6d000): /powernet/_smn_/user47_1: PortCore spun off a connection
yarp(12c6d000): /powernet/_smn_/user47_1: / routine check of connections to this port begins
yarp(12c6d000): /powernet/_smn_/user47_1: | checking connection ->->
yarp(12c6d000): /powernet/_smn_/user47_1: \ routine check of connections to this port ends
yarp(12cf0000): TextCarrier::expectSenderSpecifier
yarp(12cf0000): Receiving input from admin to /powernet/_smn_/user47_1 using text_ack
yarp(12cf0000): /powernet/_smn_/user47_1: asked to add output to tcp://powernet/user47_1/_gc_
yarp(12cf0000): query name /powernet/user47_1/_gc_
yarp(12cf0000): Configuration file: /Users/truong/Library/Application Support/yarp/config/yarp_namespace.conf
yarp(12cf0000): sending to nameserver: NAME_SERVER query /powernet/user47_1/_gc_
yarp(12cf0000): connecting to /128.178.5.139:10000
yarp(12cf0000): TCP connection to tcp://128.178.5.139:10000 failed to open
yarp: No connection to nameserver
yarp: *** try running: yarp detect ***
yarp(12cf0000): bad socket read
yarp(12cf0000): PortCoreInputUnit closing ip
yarp(74598300): sending to nameserver: NAME_SERVER announce /powernet/user47_1/_gc_ 0
yarp(74598300): connecting to /128.178.5.139:10000
yarp(12cf0000): PortCoreInputUnit closed ip
yarp(12cf0000): PortCoreInputUnit (unrooted) shutting down
yarp(12cf0000): Thread shutting down
yarp(74598300): <<< received from nameserver: [ok]
yarp(74598300): working on connection /powernet/_smn_/user47_1 to /powernet/user47_1/_gc_ (connect)
yarp(74598300): query name /powernet/_smn_/user47_1
yarp(74598300): Configuration file: /Users/truong/Library/Application Support/yarp/config/yarp_namespace.conf
yarp(74598300): sending to nameserver: NAME_SERVER query /powernet/_smn_/user47_1
yarp(74598300): connecting to /128.178.5.139:10000
yarp(74598300): <<< received from nameserver: registration name /powernet/_smn_/user47_1 ip 128.178.5.100 port 10535 type tcp
*** end of message
UPDATE on the second comment: I'm quite certain this has something to do with the Yarp nameserver's ability to handle connections. I've tried the same code on 3 different machines (however they are all Macs: a macbook pro, a mac pro desktop, and a mac mini), the same connection (between the same pair of ports) always failed every time. However, when I reduced the number of ports (by 3) and connections (by just 2), it worked perfectly.
In another experiment, I changed the order of the connection requests, and the last one always failed. So I think this has to do with the number of connections. Another explanation could be a problem with Mac OS, e.g. a limit on the number of socket connections imposed by the OS. But this is unlikely; a few hundreds connections should not be too much for a modern OS.
I am using Yarp stable 2.3.64.
Recently, when I develop a rather large-scale system, with over 400 ports (almost 100 processes, each has between 2 and dozens Yarp ports), there is an issue with removing the ports when the programs exits. When a process ends, all the ports opened by this process are closed and unregistered from the nameserver. However, in my large-scale system, many processes fail to remove all of their ports when they end (some can be removed but others can't). Specifically, the error I got from a process looked like this:
As you can see, this process opened 2 ports (/powernet/user13_1/gc and /powernet/user13_1/y) but only the first could be removed. The second couldn't be removed because of "no connection to nameserver" (although I checked carefully and the connection seemed fine, no problem with Yarp configuration). This problem didn't happen with smaller scale systems I tried. The problem also seems to happen randomly, in the sense that the set of ports which couldn't be removed changed every time I tried.
UPDATE: I think the problem was that when all the processes exit at almost the same time, the number of requests to close ports/connections sent to the nameserver was too large in a very short duration, overloading the nameserver (like a DOS attack). I made a temporary fix for the issue by: a) turning the Yarp verbosity level to -1; b) insert a random delay before exit into each process, so they won't all request to close ports at the same time. This is to reduce the load on the nameserver. The fix seems to work fine so far.