Closed pseudotensor closed 3 months ago
When trying to use, then hit:
Exception in thread Thread-1 (_wait_and_warmup):
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connectionpool.py", line 537, in _make_request
response = conn.getresponse()
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connection.py", line 466, in getresponse
httplib_response = super().getresponse()
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/http/client.py", line 1375, in getresponse
response.begin()
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/http/client.py", line 279, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
TimeoutError: timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/requests/adapters.py", line 589, in send
resp = conn.urlopen(
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/util/retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/util/util.py", line 39, in reraise
raise value
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connectionpool.py", line 539, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connectionpool.py", line 370, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='0.0.0.0', port=30010): Read timed out. (read timeout=600)
:
and:
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='0.0.0.0', port=30010): Read timed out. (read timeout=600)
Initialization failed. warmup error: HTTPConnectionPool(host='0.0.0.0', port=30010): Read timed out. (read timeout=600)
INFO: 172.16.0.42:34916 - "GET /health HTTP/1.1" 200 OK
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 3] Timeout at NCCL work: 164, last enqueued NCCL work: 309, last completed NCCL work: 163.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e733629e897 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7e72ea27b1b2 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7e72ea27ffd0 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7e72ea28131c in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7e73367f5bf4 in /home/ubuntu/miniconda3/envs/sglang/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7e73db294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7e73db326850 in /lib/x86_64-linux-gnu/libc.so.6)
[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600025 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 0] Timeout at NCCL work: 164, last enqueued NCCL work: 309, last completed NCCL work: 163.
[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600025 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e733629e897 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7e72ea27b1b2 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7e72ea27ffd0 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7e72ea28131c in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7e73367f5bf4 in /home/ubuntu/miniconda3/envs/sglang/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7e73db294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7e73db326850 in /lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 2] Timeout at NCCL work: 164, last enqueued NCCL work: 309, last completed NCCL work: 163.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e733629e897 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7e72ea27b1b2 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7e72ea27ffd0 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7e72ea28131c in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7e73367f5bf4 in /home/ubuntu/miniconda3/envs/sglang/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7e73db294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7e73db326850 in /lib/x86_64-linux-gnu/libc.so.6)
INFO: 172.16.0.42:15438 - "GET /health HTTP/1.1" 200 OK
INFO: 172.16.0.42:36646 - "GET /health HTTP/1.1" 200 OK
INFO: 172.16.0.42:12566 - "GET /health HTTP/1.1" 200 OK
Seems work-around is to set HOME to normal drive location, while I had $HOME/.triton pointing to /.triton (permissions not issue) to avoid home that is NFS and leads to other triton problems.
Same issues here. Any solution? My env is located on NFS but $HOME is not NFS
always hit this:
As you see a curl for model info still works, but I think there is problem.