pytorch / tensorpipe

A tensor-aware point-to-point communication primitive for machine learning
Other
248 stars 77 forks source link

rv < 0: too many open files #460

Open hamidralmasi opened 1 year ago

hamidralmasi commented 1 year ago

I'm trying to use torch RPC for distributed training in a parameter server architecture. With a limited (less than 20) number of workers, everything works fine but as I increase the number of workers to 20 or beyond, I get the following runtime error:

terminate called after throwing an instance of 'std::runtime_error' what(): In connectFromLoop at tensorpipe/transport/uv/uv.h:297 "rv < 0: too many open files"

followed by:

[W tensorpipe_agent.cpp:726] RPC agent for worker:2 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:8 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:3 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:9 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:18 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:13 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:0 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:4 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:5 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:16 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:1 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:6 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:7 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:17 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:10 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:14 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:12 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:11 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:15 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132) [W tensorpipe_agent.cpp:726] RPC agent for worker:19 encountered error when reading incoming request from ps:0: EOF: end of file (this error originated at tensorpipe/transport/uv/connection_impl.cc:132)

I call the init_rpc with these arguments:

rpc.init_rpc('worker:{}'.format(rank-num_ps), rank=rank, world_size=world_size, rpc_backend_options=rpc.TensorPipeRpcBackendOptions(init_method='env://', _transports=["uv"],))

I'm using pytorch 1.13 with cuda toolkit 11.7 but previously experienced a similar issue with pytorch 1.8.1 with cuda 10.2 as well.

Using cat /proc/sys/fs/file-max gives me: 9223372036854775807 and logging the number of open files I can confirm that this is never met. I'm curious where the issue might be coming from and how it should be fixed.

Thank you!