microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.51k stars 4.29k forks source link

Having trouble running CNTK with MPI #3380

Closed naxingyu closed 6 years ago

naxingyu commented 6 years ago

I have two machines both running Windows Server 2016 in the same group, say svr1.corpnet.com and svr2.corpnet.com. I have MPI7 installed as instructed on the wiki and CNTK-1bit in the same folder on two servers. I have MPI server running using Administrator on both servers. Then login svr1 and tried two methods:

  1. mpiexec -hosts 1 svr2.corpnet.com myExe.exe It runs successfully.
  2. mpiexec -hosts 2 svr1.corpnet.com svr2.corpnet.com myExe.exe It says
    
    job aborted:
    [ranks] message

[0] fatal error Fatal error in MPI_Allgather: Other MPI error, error stack: MPI_Allgather(sbuf=0x00000001340BEB90, scount=129, MPI_CHAR, rbuf=0x000001D7285B6010, rcount=129, MPI_CHAR, MPI_COMM_WORLD) failed [ch3:sock] failed to connnect to remote process c971ed15-1862-45b9-bdf6-d6503215965e:2 unable to connect to 10.172.134.94 on port 24332, exhausted all endpoints unable to connect to 10.172.134.94 on port 24332, A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)

[1-2] terminated


I am sure that I can ping the servers from each other. It seems a MPI setup issue, not CNTK. But is there anyone in this community happen to have a clue about what is going on?
jaliyae commented 6 years ago

This could be due to a firewall blocking incoming connections. Adding a firewall rule might help.

naxingyu commented 6 years ago

I've already added the firewall rules. But it didn't help.

On Sat, Sep 1, 2018 at 8:10 AM Jaliya Ekanayake notifications@github.com wrote:

This could be due to a firewall blocking incoming connections. Adding a firewall rule might help.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Microsoft/CNTK/issues/3380#issuecomment-417817825, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKpxOuRGIZvPWyHL4sefURcJsOAVKw5ks5uWdCLgaJpZM4WPT2r .

jaliyae commented 6 years ago

Since this is really regarding getting MS MPI to work in your cluster, I would try with a small MPI program and try to get it working without CNTK. I would check if both machines can execute the MPI program alone and then try to see if the machines are reachable by each other and make sure MPI service is running in both etc.

naxingyu commented 6 years ago

That make sense... Thanks!

On Tue, Sep 4, 2018 at 11:39 PM Jaliya Ekanayake notifications@github.com wrote:

Since this is really regarding getting MS MPI to work in your cluster, I would try with a small MPI program and try to get it working without CNTK. I would check if both machines can execute the MPI program alone and then try to see if the machines are reachable by each other and make sure MPI service is running in both etc.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Microsoft/CNTK/issues/3380#issuecomment-418415338, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKpxMrII17qJXIW4u-jCCGkN_GsmKR1ks5uXp6bgaJpZM4WPT2r .