Closed naxingyu closed 6 years ago
This could be due to a firewall blocking incoming connections. Adding a firewall rule might help.
I've already added the firewall rules. But it didn't help.
On Sat, Sep 1, 2018 at 8:10 AM Jaliya Ekanayake notifications@github.com wrote:
This could be due to a firewall blocking incoming connections. Adding a firewall rule might help.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Microsoft/CNTK/issues/3380#issuecomment-417817825, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKpxOuRGIZvPWyHL4sefURcJsOAVKw5ks5uWdCLgaJpZM4WPT2r .
Since this is really regarding getting MS MPI to work in your cluster, I would try with a small MPI program and try to get it working without CNTK. I would check if both machines can execute the MPI program alone and then try to see if the machines are reachable by each other and make sure MPI service is running in both etc.
That make sense... Thanks!
On Tue, Sep 4, 2018 at 11:39 PM Jaliya Ekanayake notifications@github.com wrote:
Since this is really regarding getting MS MPI to work in your cluster, I would try with a small MPI program and try to get it working without CNTK. I would check if both machines can execute the MPI program alone and then try to see if the machines are reachable by each other and make sure MPI service is running in both etc.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Microsoft/CNTK/issues/3380#issuecomment-418415338, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKpxMrII17qJXIW4u-jCCGkN_GsmKR1ks5uXp6bgaJpZM4WPT2r .
I have two machines both running Windows Server 2016 in the same group, say svr1.corpnet.com and svr2.corpnet.com. I have MPI7 installed as instructed on the wiki and CNTK-1bit in the same folder on two servers. I have MPI server running using Administrator on both servers. Then login svr1 and tried two methods:
[0] fatal error Fatal error in MPI_Allgather: Other MPI error, error stack: MPI_Allgather(sbuf=0x00000001340BEB90, scount=129, MPI_CHAR, rbuf=0x000001D7285B6010, rcount=129, MPI_CHAR, MPI_COMM_WORLD) failed [ch3:sock] failed to connnect to remote process c971ed15-1862-45b9-bdf6-d6503215965e:2 unable to connect to 10.172.134.94 on port 24332, exhausted all endpoints unable to connect to 10.172.134.94 on port 24332, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)
[1-2] terminated