jppf-grid / JPPF

The open source grid computing solution
https://www.jppf.org
Apache License 2.0
64 stars 12 forks source link

Connection reset by peer #5

Open pparthenhs opened 4 years ago

pparthenhs commented 4 years ago

I have dockerized the JPPF-node and JPPF-master 5.2.9. When I submit Jobs I received the following error on master

2020-02-07 20:40:45,472 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=40, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=40], state=WAITING_RESULTS, uuid=4855D510-8C5F-F0B0-12AB-448AABEEB221, connectionUuid=null, peer=false, ssl=false]] : java.io.IOException: Connection reset by peer
lolocohen commented 4 years ago

This warning means that a JPPF node was disconnected from a JPPF driver (I think that's what you call master). This is a normal message you receive when a node is shtudown or restarted. It may also happen if an error occurs while processing a job.

Without more details, I cannot say what the prolbem is. Is there any way you can retrieve the JPPF log from the node's container? It should be a file called jppf-node.log in the directory where JPPF is installed

pparthenhs commented 4 years ago

@lolocohen thank you for your kind replay

The node logs inside the container tail -f jppf-node.log

2020-02-10 16:29:46,575 [INFO ][org.jppf.utils.FileUtils.initJPPFTempDir(547)]: JPPF temp folder /tmp/.jppf
2020-02-10 16:29:50,412 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(79)]: --------------------------------------------------------------------------------
2020-02-10 16:29:50,413 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: JPPF Version: 5.2.9, Build number: 1912, Build date: 2018-04-02 11:00 CEST
2020-02-10 16:29:50,413 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: starting node with PID=38, UUID=C9215247-544D-E674-6FB3-132545955178
2020-02-10 16:29:50,414 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: --------------------------------------------------------------------------------
2020-02-10 16:29:53,192 [INFO ][org.jppf.classloader.ClassLoaderRequestHandler.run(156)]: maxBatchSize = 1
2020-02-10 16:29:53,529 [INFO ][org.jppf.execute.AbstractExecutionManager.<init>(111)]: running 1 processing thread
2020-02-10 16:29:53,530 [INFO ][org.jppf.execute.AbstractExecutionManager.createThreadManager(137)]: Using default thread manager

The driver logs inside the container tail -f jppf-driver.log

2020-02-10 16:28:09,711 [INFO ][org.jppf.utils.FileUtils.initJPPFTempDir(547)]: JPPF temp folder /tmp/.jppf
2020-02-10 16:28:09,946 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(79)]: --------------------------------------------------------------------------------
2020-02-10 16:28:09,947 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(80)]: JPPF Version: 5.2.9, Build number: 1912, Build date: 2018-04-02 11:00 CEST
2020-02-10 16:28:09,953 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(81)]: starting driver with PID=27, UUID=2E44CF72-408C-4EB0-4CDE-5B7A1C239BB6
2020-02-10 16:28:09,953 [INFO ][org.jppf.utils.VersionUtils.logVersionInformation(82)]: --------------------------------------------------------------------------------
2020-02-10 16:28:10,135 [INFO ][org.jppf.nio.NioConstants.getCheckConnection(81)]: NIO checks are enabled
2020-02-10 16:34:27,139 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=4, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=4], state=WAITING_RESULTS, uuid=C9215247-544D-E674-6FB3-132545955178, connectionUuid=null, peer=false, ssl=false]] : java.io.IOException: Connection reset by peer
2020-02-10 16:34:27,147 [WARN ][org.jppf.nio.StateTransitionTask.run(89)]: error on channel SelectionKeyWrapper[id=8, readyOps=1, interestOps=0, context=RemoteNodeContext[channel=SelectionKeyWrapper[id=8], state=WAITING_RESULTS, uuid=3E6F6CF2-C028-ACE0-F795-AF00CBFD5871, connectionUuid=null, peer=false, ssl=false]] : java.io.IOException: Connection reset by peer

The weird behavior is the following, step by step: A. I submit from client 30 jobs, whose jobs are distributed and they are completed correctly. B. I submit from client 30 jobs (same as previous) I retrieve the previous error on JPPF driver for java.io.IOException: Connection reset by peer without executing even one Job.

The only solution I found, was to restart each docker container node before I submit the jobs.