When try to run engine.shutdown(), it fails due to RRef leaks, and the resultant empty pipe error. So when I ran the tests, particularly in tests/test_engine, four tests will be run and stuck there.
Possible causes:
master and workers are not gracefully shut down, resulting in residual RPC agent. So the second test will run into existing RPC agent, and long time hang there showing Socket Timeout.
Detected RRef Leaks during shutdown. This usually occurs when the application code still holds references to RRef instances when calling shutdown(). If the program has completed correctly and the process is exiting, it is OK to ignore these leaks. However, if you program will keep running after this, these leaks could result in memory leaks on RRef owners.
Debugging output:
============================== test session starts ========================================
platform linux -- Python 3.8.12, pytest-7.2.0, pluggy-1.0.0
plugins: xdist-3.0.2, anyio-3.6.2
collected 4 items
**tests/test_engine/test_hybrid.py**
[12/15/22 09:15:23] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
[12/15/22 09:15:23] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
[12/15/22 09:15:23] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
[12/15/22 09:15:23] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
[12/15/22 09:15:26] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
[12/15/22 09:15:26] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default
parallel seed is ParallelMode.DATA.
[12/15/22 09:15:26] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
[12/15/22 09:15:26] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default
parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2049,the default
parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2048,the default
parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/initialize.py:117 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 2, tensor parallel size: 2
[12/15/22 09:15:27] INFO colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__
INFO colossalai - energonai - INFO: worker0 start
[12/15/22 09:15:27] INFO colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__
INFO colossalai - energonai - INFO: worker1 start
[12/15/22 09:15:27] INFO colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__
INFO colossalai - energonai - INFO: worker2 start
[12/15/22 09:15:27] INFO colossalai - energonai - INFO: /root/Projects/energonai/worker.py:61 __init__
INFO colossalai - energonai - INFO: worker3 start
Process SpawnProcess-1:
Traceback (most recent call last):
File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
return self.local_queue.get_nowait()
File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
return self.get(block=False)
File "/usr/local/lib/python3.8/queue.py", line 167, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/Projects/energonai/worker.py", line 68, in _start
task_entry: TaskEntry = self.input_pipe.recv_nowait()
File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 62, in __init__
self._start()
File "/root/Projects/energonai/worker.py", line 73, in _start
time.sleep(0.01)
KeyboardInterrupt
[W rref_context.cpp:156] **Detected RRef Leaks during shutdown. This usually occurs when the application code still holds references to RRef instances when calling shutdown(). If the program has completed correctly and the process is exiting, it is OK to ignore these leaks. However, if you program will keep running after this, these leaks could result in memory leaks on RRef owners. Please make sure all RRefs are out of scope and Python GC has deleted them before calling shutdown():
Leaking RRef GloballyUniqueId(created_on=3, local_id=0) with fork Id GloballyUniqueId(created_on=3, local_id=1)
Leaking RRef GloballyUniqueId(created_on=4, local_id=0) with fork Id GloballyUniqueId(created_on=4, local_id=1)
Leaking RRef GloballyUniqueId(created_on=1, local_id=0) with fork Id GloballyUniqueId(created_on=1, local_id=1)
Leaking RRef GloballyUniqueId(created_on=2, local_id=0) with fork Id GloballyUniqueId(created_on=2, local_id=1)**
[W tensorpipe_agent.cpp:726] RPC agent for worker3 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker2 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker1 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for worker0 encountered error when reading incoming request from master: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker0: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker1: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker2: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
[W tensorpipe_agent.cpp:726] RPC agent for master encountered error when reading incoming request from worker3: pipe closed (this error originated at tensorpipe/core/pipe_impl.cc:356)
Process SpawnProcess-4:
Traceback (most recent call last):
File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
return self.local_queue.get_nowait()
File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
return self.get(block=False)
File "/usr/local/lib/python3.8/queue.py", line 167, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/Projects/energonai/worker.py", line 68, in _start
task_entry: TaskEntry = self.input_pipe.recv_nowait()
File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 62, in __init__
self._start()
File "/root/Projects/energonai/worker.py", line 73, in _start
time.sleep(0.01)
KeyboardInterrupt
Process SpawnProcess-2:
Traceback (most recent call last):
File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
return self.local_queue.get_nowait()
File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
return self.get(block=False)
File "/usr/local/lib/python3.8/queue.py", line 167, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/Projects/energonai/worker.py", line 68, in _start
task_entry: TaskEntry = self.input_pipe.recv_nowait()
File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 62, in __init__
self._start()
File "/root/Projects/energonai/worker.py", line 73, in _start
time.sleep(0.01)
KeyboardInterrupt
Process SpawnProcess-3:
.Traceback (most recent call last):
File "/root/Projects/energonai/pipe.py", line 66, in recv_nowait
return self.local_queue.get_nowait()
File "/usr/local/lib/python3.8/queue.py", line 198, in get_nowait
return self.get(block=False)
File "/usr/local/lib/python3.8/queue.py", line 167, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/Projects/energonai/worker.py", line 68, in _start
task_entry: TaskEntry = self.input_pipe.recv_nowait()
File "/root/Projects/energonai/pipe.py", line 68, in recv_nowait
raise RuntimeError('pipe is empty')
RuntimeError: pipe is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 62, in __init__
self._start()
File "/root/Projects/energonai/worker.py", line 73, in _start
time.sleep(0.01)
KeyboardInterrupt
**tests/test_engine/test_pp.py [12/15/22 09:15:31] I**NFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
[12/15/22 09:15:31] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:521 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[12/15/22 09:15:32] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
[12/15/22 09:15:32] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:557 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 2048,the default
parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default
parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/site-packages/colossalai/initialize.py:117 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 2, tensor parallel size: 1
Process SpawnProcess-6:
Process SpawnProcess-5:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 35, in __init__
trpc.init_rpc(self.rpc_name, rank=self.global_rank + 1, world_size=self.world_size + 1,
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 332, in _tensorpipe_init_backend_handler
group = _init_process_group(store, rank, world_size)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 109, in _init_process_group
group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Socket Timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/Projects/energonai/worker.py", line 35, in __init__
trpc.init_rpc(self.rpc_name, rank=self.global_rank + 1, world_size=self.world_size + 1,
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 231, in _init_rpc_backend
rpc_agent = backend_registry.init_backend(
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 332, in _tensorpipe_init_backend_handler
group = _init_process_group(store, rank, world_size)
File "/usr/local/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 109, in _init_process_group
group = dist.ProcessGroupGloo(store, rank, world_size, process_group_timeout)
RuntimeError: Socket Timeout
System settings:
Python 3.8
CUDA Version: 11.3
PyTorch Version: 1.12.1+cu113
CUDA Version in PyTorch Build: 11.3
PyTorch CUDA Version Match: ✓
CUDA Extension: ✓
When try to run engine.shutdown(), it fails due to RRef leaks, and the resultant empty pipe error. So when I ran the tests, particularly in tests/test_engine, four tests will be run and stuck there.
Possible causes:
Debugging output:
System settings:
Python 3.8 CUDA Version: 11.3 PyTorch Version: 1.12.1+cu113 CUDA Version in PyTorch Build: 11.3 PyTorch CUDA Version Match: ✓ CUDA Extension: ✓
colossalai 0.1.11rc1+torch1.12cu11.3 energonai 0.0.1+torch1.12cu11.3 torch 1.12.1+cu113 torchaudio 0.10.2+cu113 torchvision 0.13.1+cu113