consistently hitting FileNotFoundError for triton cache kernel stage

export CUDA_VISIBLE_DEVICES="0,1,2,3"
python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30010 --host="0.0.0.0" --tp-size=4 --random-seed=1234 --context-length=32768

always hit this:

Exception in ModelRpcClient:
Traceback (most recent call last):
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 175, in exposed_step
    self.forward_step()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 204, in forward_step
    self.forward_decode_batch(self.running_batch)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_rpc.py", line 534, in forward_decode_batch
    ) = self.model_runner.forward(batch, ForwardMode.DECODE)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 406, in forward
    return self.forward_decode(batch)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/managers/router/model_runner.py", line 375, in forward_decode
    return self.model.forward(
  File "/home/ubuntu/sglang/python/sglang/srt/models/llava_qwen.py", line 247, in forward
    return self.language_model(input_ids, positions, input_metadata)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/qwen2.py", line 269, in forward
    hidden_states = self.model(input_ids, positions, input_metadata, input_embeds)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/qwen2.py", line 239, in forward
    hidden_states, residual = layer(
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/qwen2.py", line 191, in forward
    hidden_states = self.self_attn(
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/models/qwen2.py", line 140, in forward
    attn_output = self.attn(q, k, v, input_metadata)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/sglang/python/sglang/srt/layers/radix_attention.py", line 125, in forward
    return self.decode_forward(q, k, v, input_metadata)
  File "/home/ubuntu/sglang/python/sglang/srt/layers/radix_attention.py", line 79, in decode_forward_triton
    token_attention_fwd(
  File "/home/ubuntu/sglang/python/sglang/srt/layers/token_attention.py", line 339, in token_attention_fwd
    _token_softmax_reducev_fwd(
  File "/home/ubuntu/sglang/python/sglang/srt/layers/token_attention.py", line 284, in _token_softmax_reducev_fwd
    _fwd_kernel_stage2[grid](
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
    self.cache[device][key] = compile(
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/triton/compiler/compiler.py", line 202, in compile
    return CompiledKernel(so_path, metadata_group.get(metadata_filename))
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/triton/compiler/compiler.py", line 230, in __init__
    self.asm = {
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/triton/compiler/compiler.py", line 231, in <dictcomp>
    file.suffix[1:]: file.read_bytes() if file.suffix[1:] == driver.binary_ext else file.read_text()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/pathlib.py", line 1134, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.triton/cache/8027c157df5f85e0655fe2f49379130c/_fwd_kernel_stage2.json.tmp.pid_4180629_32310'

INFO:     127.0.0.1:42948 - "GET /get_model_info HTTP/1.1" 200 OK

As you see a curl for model info still works, but I think there is problem.

When trying to use, then hit:

Exception in thread Thread-1 (_wait_and_warmup):
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/requests/adapters.py", line 589, in send
    resp = conn.urlopen(
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/util/retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connectionpool.py", line 539, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/urllib3/connectionpool.py", line 370, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='0.0.0.0', port=30010): Read timed out. (read timeout=600)
:

and:

requests.exceptions.ReadTimeout: HTTPConnectionPool(host='0.0.0.0', port=30010): Read timed out. (read timeout=600)
Initialization failed. warmup error: HTTPConnectionPool(host='0.0.0.0', port=30010): Read timed out. (read timeout=600)
INFO:     172.16.0.42:34916 - "GET /health HTTP/1.1" 200 OK
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 3] Timeout at NCCL work: 164, last enqueued NCCL work: 309, last completed NCCL work: 163.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e733629e897 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7e72ea27b1b2 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7e72ea27ffd0 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7e72ea28131c in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7e73367f5bf4 in /home/ubuntu/miniconda3/envs/sglang/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7e73db294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7e73db326850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600025 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 0] Timeout at NCCL work: 164, last enqueued NCCL work: 309, last completed NCCL work: 163.
[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600025 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e733629e897 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7e72ea27b1b2 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7e72ea27ffd0 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7e72ea28131c in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7e73367f5bf4 in /home/ubuntu/miniconda3/envs/sglang/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7e73db294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7e73db326850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 2] Timeout at NCCL work: 164, last enqueued NCCL work: 309, last completed NCCL work: 163.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=164, OpType=ALLREDUCE, NumelIn=8192, NumelOut=8192, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e733629e897 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7e72ea27b1b2 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7e72ea27ffd0 in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7e72ea28131c in /home/ubuntu/miniconda3/envs/sglang/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7e73367f5bf4 in /home/ubuntu/miniconda3/envs/sglang/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7e73db294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7e73db326850 in /lib/x86_64-linux-gnu/libc.so.6)

INFO:     172.16.0.42:15438 - "GET /health HTTP/1.1" 200 OK
INFO:     172.16.0.42:36646 - "GET /health HTTP/1.1" 200 OK
INFO:     172.16.0.42:12566 - "GET /health HTTP/1.1" 200 OK

sgl-project / sglang

consistently hitting FileNotFoundError for triton cache kernel stage #472