I'm trying to use torch RPC for distributed training in a parameter server architecture. With a limited (less than 20) number of workers, everything works fine but as I increase the number of workers to 20 or beyond, I get the following runtime error:
terminate called after throwing an instance of 'std::runtime_error' what(): In connectFromLoop at tensorpipe/transport/uv/uv.h:297 "rv < 0: too many open files"
followed by:
[W tensorpipe_agent.cpp:726] RPC agent for worker:2 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:8 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:3 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:9 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:18 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:13 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:0 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:4 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:5 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:16 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:1 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:6 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:7 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:17 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:10 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:14 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:12 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:11 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:15 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
[W tensorpipe_agent.cpp:726] RPC agent for worker:19 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
I'm using pytorch 1.13 with cuda toolkit 11.7 but previously experienced a similar issue with pytorch 1.8.1 with cuda 10.2 as well.
Using cat /proc/sys/fs/file-max gives me: 9223372036854775807 and logging the number of open files I can confirm that this is never met. I'm curious where the issue might be coming from and how it should be fixed.
I'm trying to use torch RPC for distributed training in a parameter server architecture. With a limited (less than 20) number of workers, everything works fine but as I increase the number of workers to 20 or beyond, I get the following runtime error:
terminate called after throwing an instance of 'std::runtime_error' what(): In connectFromLoop at tensorpipe/transport/uv/uv.h:297 "rv < 0: too many open files"
followed by:
[W tensorpipe_agent.cpp:726] RPC agent for worker:2 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:8 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:3 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:9 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:18 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:13 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:0 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:4 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:5 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:16 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:1 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:6 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:7 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:17 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:10 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:14 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:12 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:11 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:15 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:19 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)
I call the init_rpc with these arguments:
rpc.init_rpc('worker:{}'.format(rank-num_ps), rank=rank, world_size=world_size, rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method='env://', _transports=["uv"],))
I'm using pytorch 1.13 with cuda toolkit 11.7 but previously experienced a similar issue with pytorch 1.8.1 with cuda 10.2 as well.
Using
cat /proc/sys/fs/file-max
gives me: 9223372036854775807 and logging the number of open files I can confirm that this is never met. I'm curious where the issue might be coming from and how it should be fixed.Thank you!