ow2-proactive / scheduling

Multi-platform Scheduling and Workflows Engine
http://www.activeeon.com/workflows-scheduling
GNU Affero General Public License v3.0
62 stars 55 forks source link

Unable to disable PAMR protocol on ProActive nodes. #2904

Open vinseon opened 7 years ago

vinseon commented 7 years ago

Issue description

The issue appears when we activate multiple communication protocols (PNP+PAMR for example) on the resource manager, for example:

proactive.communication.protocol=pnp
proactive.communication.additional_protocols=pamr

Then, when we connect a node using PNP (with the right cmdline option such as: --rmURL=pnp://proactive-scheduler:64738) while setting PNP protocol only in the node configuration file (config/network/node.ini) like this:

proactive.communication.protocol=pnp
proactive.communication.additional_protocols=

The node automatically enables PAMR protocol during startup which results in the following error:

Failed to create the PAMR tunnel to localhost/127.0.0.1:33647. PAMR will probably not work.

The node also keeps doing connection attempts for each worker with increasing delay:

PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 2 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 2 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 4 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 4 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 8 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 8 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 8 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 16 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 16 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 16 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 32 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 32 seconds
PAMR Router localhost/127.0.0.1:33647 is unreachable (Connection refused (Connection refused)). Will try to estalish a new tunnel in 32 seconds

From @fviale reply, it seems that when multi-protocol pnp/pamr is enabled on the resource manager, even if the node tries to connect with pnp, the server will respond with both protocols. And the node will then enable the extra protocols by bypassing its own configuration.

In addition, when we put an empty/incorrect value for the PAMR router in the node.ini config file like this:

proactive.pamr.router.address=

we then get the following warning during startup saying that the PAMR protocol will be disabled:

[ROExposer] Protocol pamr seems invalid for this runtime, this is not a critical error, the protocol will be disabled.

But PAMR stays enabled and the node keeps doing connection attempts to a non existing/default local PAMR router.

Mitigation

By disabling PAMR on node side

A shitty workaround that actually disable PAMR on node side consists to define an empty value to the PAMR port, like this

proactive.pamr.router.port=

This will trigger the following exception during startup:

Invalid value, for key proactive.pamr.router.port. Must be a INTEGER

and it actually disable the PAMR protocol so no more connection attempts are experienced.

By keeping PAMR enabled on node side

As suggested by @fviale, to avoid PAMR errors and unsuccessfull connection attempts, the simplest workaround is to properly configure the PAMR address and port from the node.ini config file (even if we don't use it):

proactive.pamr.router.address=proactive-scheduler
proactive.pamr.router.port=33647
fviale commented 7 years ago

Are you trying to connect the node to a scheduler with a pamr url ? (like on try/trydev, etc). In which case it's normal

fviale commented 7 years ago

Another possibility is that multi-protocol pnp/pamr is enabled on the server. In that case, even if the node tries to connect with pnp, the server will respond with both protocols.

vinseon commented 7 years ago

We specify a PNP url from cmdline, but indeed multi-protocol is enabled on scheduler side which seems to be the key of this issue. Anyway, the list of enabled communication protocols should strictly rely on node configuration (as there is a dedicated file for that), extra protocols must not be enabled at runtime from remote RM configuration.

bamedro commented 7 years ago

and we have the same issue when using PAMR on the node, when both protocols are enabled on the server. As Fabien said, server is responding with both protocols, producing a stacktrace on node side. This is not blocking and everything is find at the end. But seeing a stacktrace with network issue can be disturbing for a user.

Execution command:


/opt/proactive/java/bin/java -jar /opt/proactive/node.jar -Dproactive.net.nolocal=true -Dproactive.communication.protocol=pamr -Dproactive.pamr.router.address=176.167.228.111 -v /* CRED */ -w 3 -r pamr://0 -n ns1155-0 -s ns1155

Oct 11 10:13:56 debian proactive-node.sh[1010]: Node pamr://4097/ns1155-0_1 added.
Oct 11 10:13:58 debian proactive-node.sh[1010]: Adding node ns1155-0_2 to Resource Manager.
Oct 11 10:14:09 debian proactive-node.sh[1010]: [ROAdapter] Skipping default protocol pnp because of received exception
Oct 11 10:14:09 debian proactive-node.sh[1010]: org.objectweb.proactive.core.exceptions.IOException6: Failed to send PNP message to pnp://192.168.1.50:64738/ActiveObject_org.ow2.proactive.resourcemanager.core.RMCore_-45100344-15f0ad8177b--7fff--6dc2643e9ab2967c--45100344-15f0ad8177b--8000
  /* Big Stack Trace with some Caused by: java.net.ConnectException: Connection refused: 0.0.0.0/0.0.0.0:64738 */
Oct 11 10:14:21 debian proactive-node.sh[1010]: Node pamr://4097/ns1155-0_2 added.
Oct 11 10:14:21 debian proactive-node.sh[1010]: Connected to the resource manager at pamr://0