Open vinseon opened 7 years ago
Are you trying to connect the node to a scheduler with a pamr url ? (like on try/trydev, etc). In which case it's normal
Another possibility is that multi-protocol pnp/pamr is enabled on the server. In that case, even if the node tries to connect with pnp, the server will respond with both protocols.
We specify a PNP url from cmdline, but indeed multi-protocol is enabled on scheduler side which seems to be the key of this issue. Anyway, the list of enabled communication protocols should strictly rely on node configuration (as there is a dedicated file for that), extra protocols must not be enabled at runtime from remote RM configuration.
and we have the same issue when using PAMR on the node, when both protocols are enabled on the server. As Fabien said, server is responding with both protocols, producing a stacktrace on node side. This is not blocking and everything is find at the end. But seeing a stacktrace with network issue can be disturbing for a user.
Execution command:
/opt/proactive/java/bin/java -jar /opt/proactive/node.jar -Dproactive.net.nolocal=true -Dproactive.communication.protocol=pamr -Dproactive.pamr.router.address=176.167.228.111 -v /* CRED */ -w 3 -r pamr://0 -n ns1155-0 -s ns1155
Oct 11 10:13:56 debian proactive-node.sh[1010]: Node pamr://4097/ns1155-0_1 added.
Oct 11 10:13:58 debian proactive-node.sh[1010]: Adding node ns1155-0_2 to Resource Manager.
Oct 11 10:14:09 debian proactive-node.sh[1010]: [ROAdapter] Skipping default protocol pnp because of received exception
Oct 11 10:14:09 debian proactive-node.sh[1010]: org.objectweb.proactive.core.exceptions.IOException6: Failed to send PNP message to pnp://192.168.1.50:64738/ActiveObject_org.ow2.proactive.resourcemanager.core.RMCore_-45100344-15f0ad8177b--7fff--6dc2643e9ab2967c--45100344-15f0ad8177b--8000
/* Big Stack Trace with some Caused by: java.net.ConnectException: Connection refused: 0.0.0.0/0.0.0.0:64738 */
Oct 11 10:14:21 debian proactive-node.sh[1010]: Node pamr://4097/ns1155-0_2 added.
Oct 11 10:14:21 debian proactive-node.sh[1010]: Connected to the resource manager at pamr://0
Issue description
The issue appears when we activate multiple communication protocols (PNP+PAMR for example) on the resource manager, for example:
Then, when we connect a node using PNP (with the right cmdline option such as:
--rmURL=pnp://proactive-scheduler:64738
) while setting PNP protocol only in the node configuration file (config/network/node.ini
) like this:The node automatically enables PAMR protocol during startup which results in the following error:
Failed to create the PAMR tunnel to localhost/127.0.0.1:33647. PAMR will probably not work.
The node also keeps doing connection attempts for each worker with increasing delay:
From @fviale reply, it seems that when multi-protocol pnp/pamr is enabled on the resource manager, even if the node tries to connect with pnp, the server will respond with both protocols. And the node will then enable the extra protocols by bypassing its own configuration.
In addition, when we put an empty/incorrect value for the PAMR router in the
node.ini
config file like this:we then get the following warning during startup saying that the PAMR protocol will be disabled:
[ROExposer] Protocol pamr seems invalid for this runtime, this is not a critical error, the protocol will be disabled.
But PAMR stays enabled and the node keeps doing connection attempts to a non existing/default local PAMR router.
Mitigation
By disabling PAMR on node side
A shitty workaround that actually disable PAMR on node side consists to define an empty value to the PAMR port, like this
This will trigger the following exception during startup:
Invalid value, for key proactive.pamr.router.port. Must be a INTEGER
and it actually disable the PAMR protocol so no more connection attempts are experienced.
By keeping PAMR enabled on node side
As suggested by @fviale, to avoid PAMR errors and unsuccessfull connection attempts, the simplest workaround is to properly configure the PAMR address and port from the
node.ini
config file (even if we don't use it):