[Bug]: Docker build for ROCm fails for latest release and main branch

Spurthi-Bhat-ScalersAI commented 3 weeks ago

Your current environment

Server with MI300X GPUs

🐛 Describe the bug

Build the vLLM ROCm image by following this link

The Docker build fails with the following error:

 > [build_triton 1/1] RUN --mount=type=cache,target=/root/.cache/ccache     if [ "1" = "1" ]; then     mkdir -p libs     && cd libs     && git clone https://github.com/OpenAI/triton.git     && cd triton     && git checkout "main"     && cd python     && python3 setup.py bdist_wheel --dist-dir=/install;     else mkdir -p /install;     fi:
0.241 Cloning into 'triton'...
10.80 Already on 'main'
10.80 Your branch is up to date with 'origin/main'.
140.4 downloading and extracting https://anaconda.org/nvidia/cuda-nvcc/12.4.99/download/linux-64/cuda-nvcc-12.4.99-0.tar.bz2 ...
140.4 Traceback (most recent call last):
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/urllib/request.py", line 1346, in do_open
140.4     h.request(req.get_method(), req.selector, req.data, headers,
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/http/client.py", line 1285, in request
140.4     self._send_request(method, url, body, headers, encode_chunked)
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/http/client.py", line 1331, in _send_request
140.4     self.endheaders(body, encode_chunked=encode_chunked)
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/http/client.py", line 1280, in endheaders
140.4     self._send_output(message_body, encode_chunked=encode_chunked)
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/http/client.py", line 1040, in _send_output
140.4     self.send(msg)
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/http/client.py", line 980, in send
140.4     self.connect()
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/http/client.py", line 1447, in connect
140.4     super().connect()
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/http/client.py", line 946, in connect
140.4     self.sock = self._create_connection(
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/socket.py", line 844, in create_connection
140.4     raise err
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/socket.py", line 832, in create_connection
140.4     sock.connect(sa)
140.4 TimeoutError: [Errno 110] Connection timed out
140.4
140.4 During handling of the above exception, another exception occurred:
140.4
140.4 Traceback (most recent call last):
140.4   File "/vllm-workspace/libs/triton/python/setup.py", line 472, in <module>
140.4     download_and_copy(
140.4   File "/vllm-workspace/libs/triton/python/setup.py", line 293, in download_and_copy
140.4     file = tarfile.open(fileobj=open_url(url), mode="r|*")
140.4   File "/vllm-workspace/libs/triton/python/setup.py", line 216, in open_url
140.4     return urllib.request.urlopen(request, timeout=300)
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/urllib/request.py", line 214, in urlopen
140.4     return opener.open(url, data, timeout)
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/urllib/request.py", line 517, in open
140.4     response = self._open(req, data)
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/urllib/request.py", line 534, in _open
140.4     result = self._call_chain(self.handle_open, protocol, protocol +
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/urllib/request.py", line 494, in _call_chain
140.4     result = func(*args)
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/urllib/request.py", line 1389, in https_open
140.4     return self.do_open(http.client.HTTPSConnection, req,
140.4   File "/opt/conda/envs/py_3.9/lib/python3.9/urllib/request.py", line 1349, in do_open
140.4     raise URLError(err)
140.4 urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
------
Dockerfile.rocm:111
--------------------
 110 |     # Build triton wheel if `BUILD_TRITON = 1`
 111 | >>> RUN --mount=type=cache,target=${CCACHE_DIR} \
 112 | >>>     if [ "$BUILD_TRITON" = "1" ]; then \
 113 | >>>     mkdir -p libs \
 114 | >>>     && cd libs \
 115 | >>>     && git clone https://github.com/OpenAI/triton.git \
 116 | >>>     && cd triton \
 117 | >>>     && git checkout "${TRITON_BRANCH}" \
 118 | >>>     && cd python \
 119 | >>>     && python3 setup.py bdist_wheel --dist-dir=/install; \
 120 | >>>     # Create an empty directory otherwise as later build stages expect one
 121 | >>>     else mkdir -p /install; \
 122 | >>>     fi
 123 |
--------------------
ERROR: failed to solve: process "/bin/sh -c if [ \"$BUILD_TRITON\" = \"1\" ]; then     mkdir -p libs     && cd libs     && git clone https://github.com/OpenAI/triton.git     && cd triton     && git checkout \"${TRITON_BRANCH}\"     && cd python     && python3 setup.py bdist_wheel --dist-dir=/install;     else mkdir -p /install;     fi" did not complete successfully: exit code: 1

This is failing for both the latest release and the main branch.

This might be because the OpenAI Triton Repository is moved to a different GitHub Repository.

hongxiayang commented 3 weeks ago

the error is related to download the file downloading and extracting https://anaconda.org/nvidia/cuda-nvcc/12.4.99/download/linux-64/cuda-nvcc-12.4.99-0.tar.bz2 ...

One thing you can check is whether you can download manually? https://anaconda.org/nvidia/cuda-nvcc/12.4.99/download/linux-64/cuda-nvcc-12.4.99-0.tar.bz2

hongxiayang commented 3 weeks ago

btw, my download is ok when I build it.

downloading and extracting https://anaconda.org/nvidia/cuda-nvcc/12.4.99/download/linux-64/cuda-nvcc-12.4.99-0.tar.bz2 ...
#18 35.86 copy /root/.triton/nvidia/ptxas/bin/ptxas to /vllm-workspace/libs/triton/python/../third_party/nvidia/backend/bin/ptxas ...

Please retry your build to see whether it is a transient issue .

Spurthi-Bhat-ScalersAI commented 2 weeks ago

@hongxiayang Thank you for the update. Will test it out and update accordingly.

vllm-project / vllm

[Bug]: Docker build for ROCm fails for latest release and main branch #7813

Your current environment

🐛 Describe the bug