Closed Glavin001 closed 1 year ago
Thanks for writing this up, @Glavin001. I'm sorry you're hitting this issue. I've got a few theories at the moment:
cog debug
?sudo
is somehow causing a problem. What happens if you run without sudo?Thanks for your prompt reply and ideas!
cog debug
:$ sudo cog debug
#syntax=docker/dockerfile:1.4
FROM curlimages/curl AS downloader
ARG TINI_VERSION=0.19.0
WORKDIR /tmp
RUN curl -fsSL -O "https://github.com/krallin/tini/releases/download/v${TINI_VERSION}/tini" && chmod +x tini
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/bin
COPY --link --from=downloader /tmp/tini /sbin/tini
ENTRYPOINT ["/sbin/tini", "--"]
ENV PATH="/root/.pyenv/shims:/root/.pyenv/bin:$PATH"
RUN --mount=type=cache,target=/var/cache/apt apt-get update -qq && apt-get install -qqy --no-install-recommends \
make \
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
libncurses5-dev \
libncursesw5-dev \
xz-utils \
tk-dev \
libffi-dev \
liblzma-dev \
git \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
RUN curl -s -S -L https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer | bash && \
git clone https://github.com/momo-lab/pyenv-install-latest.git "$(pyenv root)"/plugins/pyenv-install-latest && \
pyenv install-latest "3.10" && \
pyenv global $(pyenv install-latest --print "3.10") && \
pip install "wheel<1"
COPY .cog/tmp/build4127551442/cog-0.0.1.dev-py3-none-any.whl /tmp/cog-0.0.1.dev-py3-none-any.whl
RUN --mount=type=cache,target=/root/.cache/pip pip install /tmp/cog-0.0.1.dev-py3-none-any.whl
COPY .cog/tmp/build4127551442/requirements.txt /tmp/requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip pip install -r /tmp/requirements.txt
WORKDIR /src
EXPOSE 5000
CMD ["python", "-m", "cog.server.http"]
COPY . /src
docker
without sudo
on Lambdalabs:$ cog build
Building Docker image from environment in cog.yaml as cog-replicate-startup-intervie...
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied
ⅹ Failed to build Docker image: exit status 1
Also this exact config cog.yaml
was working ~4 days ago when I last built and pushed model to Replicate.
Also I need to reinstall Cog each time, so may be a change between versions.
I think I was on Cog 7.2 before and recently Cog 8 (released 3 days ago)?
I'm seeing a lot of mentions of WSL (Linux on Windows?) in related issues: https://github.com/microsoft/WSL/issues/4760
Maybe LambdaLabs is using Windows in their stack? Not sure how to verify
There may be a way to check: https://github.com/microsoft/WSL/issues/4071#issuecomment-496715404
I'll try tonight.
Doesn't look like Windows WSL?
ubuntu@IP:~$ /proc/version
bash: /proc/version: Permission denied
ubuntu@IP:~$ sudo /proc/version
sudo: /proc/version: command not found
ubuntu@IP:~$ uname -a
Linux IP 5.15.0-67-generic #74~20.04.1-Ubuntu SMP Wed Feb 22 14:52:34 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
v0.8.0 is the issue.
Workaround: Downgrading to v0.7.2 fixes the issues! 🎉 ✅
$ sudo curl -o /usr/local/bin/cog -L "https://github.com/replicate/cog/releases/download/v0.7.2/cog_Linux_x86_64"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 9444k 100 9444k 0 0 10.7M 0 --:--:-- --:--:-- --:--:-- 56.5M
$ sudo chmod +x /usr/local/bin/cog
$ cog --version
cog version 0.7.2 (built 2023-05-23T10:20:56Z)
$ sudo cog --version
cog version 0.7.2 (built 2023-05-23T10:20:56Z)
$ sudo cog debug
⚠ Cog doesn't know if CUDA 11.8 is compatible with PyTorch 2.0.0. This might cause CUDA problems.
# syntax = docker/dockerfile:1.2
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/bin
RUN --mount=type=cache,target=/var/cache/apt set -eux; \
apt-get update -qq; \
apt-get install -qqy --no-install-recommends curl; \
rm -rf /var/lib/apt/lists/*; \
TINI_VERSION=v0.19.0; \
TINI_ARCH="$(dpkg --print-architecture)"; \
curl -sSL -o /sbin/tini "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-${TINI_ARCH}"; \
chmod +x /sbin/tini
ENTRYPOINT ["/sbin/tini", "--"]
ENV PATH="/root/.pyenv/shims:/root/.pyenv/bin:$PATH"
RUN --mount=type=cache,target=/var/cache/apt apt-get update -qq && apt-get install -qqy --no-install-recommends \
make \
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
libncurses5-dev \
libncursesw5-dev \
xz-utils \
tk-dev \
libffi-dev \
liblzma-dev \
git \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
RUN curl -s -S -L https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer | bash && \
git clone https://github.com/momo-lab/pyenv-install-latest.git "$(pyenv root)"/plugins/pyenv-install-latest && \
pyenv install-latest "3.10" && \
pyenv global $(pyenv install-latest --print "3.10") && \
pip install "wheel<1"
COPY .cog/tmp/build4048584965/cog-0.0.1.dev-py3-none-any.whl /tmp/cog-0.0.1.dev-py3-none-any.whl
RUN --mount=type=cache,target=/root/.cache/pip pip install /tmp/cog-0.0.1.dev-py3-none-any.whl
COPY .cog/tmp/build4048584965/requirements.txt /tmp/requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip pip install -r /tmp/requirements.txt
WORKDIR /src
EXPOSE 5000
CMD ["python", "-m", "cog.server.http"]
COPY . /src
Here's the cog debug
diff between v0.7.2 and v0.8.0:
--- v0.7.2.txt 2023-07-11 05:31:08
+++ v0.8.0.txt 2023-07-11 05:31:28
@@ -1,18 +1,14 @@
$ sudo cog debug
-⚠ Cog doesn't know if CUDA 11.8 is compatible with PyTorch 2.0.0. This might cause CUDA problems.
-# syntax = docker/dockerfile:1.2
+#syntax=docker/dockerfile:1.4
+FROM curlimages/curl AS downloader
+ARG TINI_VERSION=0.19.0
+WORKDIR /tmp
+RUN curl -fsSL -O "https://github.com/krallin/tini/releases/download/v${TINI_VERSION}/tini" && chmod +x tini
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/bin
-RUN --mount=type=cache,target=/var/cache/apt set -eux; \
-apt-get update -qq; \
-apt-get install -qqy --no-install-recommends curl; \
-rm -rf /var/lib/apt/lists/*; \
-TINI_VERSION=v0.19.0; \
-TINI_ARCH="$(dpkg --print-architecture)"; \
-curl -sSL -o /sbin/tini "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-${TINI_ARCH}"; \
-chmod +x /sbin/tini
+COPY --link --from=downloader /tmp/tini /sbin/tini
ENTRYPOINT ["/sbin/tini", "--"]
ENV PATH="/root/.pyenv/shims:/root/.pyenv/bin:$PATH"
RUN --mount=type=cache,target=/var/cache/apt apt-get update -qq && apt-get install -qqy --no-install-recommends \
@@ -40,9 +36,9 @@
pyenv install-latest "3.10" && \
pyenv global $(pyenv install-latest --print "3.10") && \
pip install "wheel<1"
-COPY .cog/tmp/build4048584965/cog-0.0.1.dev-py3-none-any.whl /tmp/cog-0.0.1.dev-py3-none-any.whl
+COPY .cog/tmp/build4127551442/cog-0.0.1.dev-py3-none-any.whl /tmp/cog-0.0.1.dev-py3-none-any.whl
RUN --mount=type=cache,target=/root/.cache/pip pip install /tmp/cog-0.0.1.dev-py3-none-any.whl
-COPY .cog/tmp/build4048584965/requirements.txt /tmp/requirements.txt
+COPY .cog/tmp/build4127551442/requirements.txt /tmp/requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip pip install -r /tmp/requirements.txt
WORKDIR /src
EXPOSE 5000
Thanks to @Glavin001 for the quick workaround! here you have the downgrade quick code:
sudo curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/download/v0.7.2/cog_`uname -s`_`uname -m`
sudo chmod +x /usr/local/bin/cog
I also hit this on lambdalabs machines, cog version 0.8.1. Another workaround is to set gpu: false
, but then you have to launch your containers manually with docker --gpus all
. At least it points to a problem in the gpu-specific places.
I think I get a hold of what caused the issue here. First and foremost to point out that the root cause of this problem lies in the lines of code that installed tini.
To verify it, I have created three simplified dockerfiles.
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/bin
ENV PATH="/root/.pyenv/shims:/root/.pyenv/bin:$PATH"
## Here's the original part that installed tini
RUN --mount=type=cache,target=/var/cache/apt set -eux; \
apt-get update -qq; \
apt-get install -qqy --no-install-recommends curl; \
rm -rf /var/lib/apt/lists/*; \
TINI_VERSION=v0.19.0; \
TINI_ARCH="$(dpkg --print-architecture)"; \
curl -sSL -o /sbin/tini "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-${TINI_ARCH}"; \
chmod +x /sbin/tini
ENTRYPOINT ["/sbin/tini", "--"]
RUN --mount=type=cache,target=/var/cache/apt apt-get update -qq && apt-get install -qqy --no-install-recommends \
make \
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
libncurses5-dev \
libncursesw5-dev \
xz-utils \
tk-dev \
libffi-dev \
liblzma-dev \
git \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
CMD ["python", "-m", "cog.server.http"]
Building works fine:
sudo docker build -t cog-stable-diffusion -f .tmp/Dockerfile1 .
This one fails at apt-get update
. This is confusing as we will see next that removing tiny
part works.
## This is the new part that downloads tiny in downloader
FROM curlimages/curl AS downloader
ARG TINI_VERSION=0.19.0
WORKDIR /tmp
RUN curl -fsSL -O "https://github.com/krallin/tini/releases/download/v${TINI_VERSION}/tini" && chmod +x tini
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/bin
## This is the new part that installs tini
COPY --link --from=downloader /tmp/tini /sbin/tini
ENTRYPOINT ["/sbin/tini", "--"]
ENV PATH="/root/.pyenv/shims:/root/.pyenv/bin:$PATH"
RUN --mount=type=cache,target=/var/cache/apt apt-get update -qq && apt-get install -qqy --no-install-recommends \
make \
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
libncurses5-dev \
libncursesw5-dev \
xz-utils \
tk-dev \
libffi-dev \
liblzma-dev \
git \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
CMD ["python", "-m", "cog.server.http"]
The following dockerfile only removes the tiny downloader and tiny copy cmd. Now it builds successfully.
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/bin
ENV PATH="/root/.pyenv/shims:/root/.pyenv/bin:$PATH"
RUN --mount=type=cache,target=/var/cache/apt apt-get update -qq && apt-get install -qqy --no-install-recommends \
make \
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
libncurses5-dev \
libncursesw5-dev \
xz-utils \
tk-dev \
libffi-dev \
liblzma-dev \
git \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
CMD ["python", "-m", "cog.server.http"]
It's obvious the change on installing tiny breaks the apt-get system. I am not sure how it breaks internally. But reverting the tiny change might be the correct solution here.
cc @mattt
Another workaround is to set gpu: false
When you set gpu: false
, the base image is python:3.10
, compared to nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04
otherwise.
I found the nuance!
The new one MISSED the ${TINI_ARCH}
:
# Old
TINI_ARCH="$(dpkg --print-architecture)"; \
curl -sSL -o /sbin/tini "https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-${TINI_ARCH}"; \
vs
# New
RUN curl -fsSL -O "https://github.com/krallin/tini/releases/download/v${TINI_VERSION}/tini" && chmod +x tini
I want to share some good news. After apply the fix https://github.com/replicate/cog/pull/1208, cog build and run work again!
Here's the dockerfile:
Hi everyone, apologies - I pushed this change in hopes of making the image smaller and faster to build.
It seems like you might have an older version of the cuda base image. The current version of 11.8.0-cudnn8-devel-ubuntu22.04 already have libc-bin installed, and also has /sbin/ldconfig.real. My guess is maybe the rm -rf /var/lib/apt/lists/*
was important. Could you post docker images --no-trunc|grep cuda
please?
Here's the output of sudo docker images --no-trunc|grep cuda
:
nvidia/cuda 11.8.0-cudnn8-devel-ubuntu22.04 sha256:422a68abd82ed6f830178fadd24e9144ddc0461e558c90dd147fcc577ddea247 4 weeks ago 9.83GB
Actually I tried to add rm -rf /var/lib/apt/lists/*
first and it still failed:
Can confirm this is still broken on Lambda Labs with new instances. Being that's the recommended cloud workflow, not fun there's not a fix yet! Is there a workaround besides downgrading back to 7.2 COG?
@djj0s3, could you paste cog debug
and docker images --no-trunc|grep cuda
?
cog debug:
#syntax=docker/dockerfile:1.4
FROM curlimages/curl AS downloader
ARG TINI_VERSION=0.19.0
WORKDIR /tmp
RUN curl -fsSL -O "https://github.com/krallin/tini/releases/download/v${TINI_VERSION}/tini" && chmod +x tini
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/bin
COPY --link --from=downloader /tmp/tini /sbin/tini
ENTRYPOINT ["/sbin/tini", "--"]
ENV PATH="/root/.pyenv/shims:/root/.pyenv/bin:$PATH"
RUN --mount=type=cache,target=/var/cache/apt apt-get update -qq && apt-get install -qqy --no-install-recommends \
make \
build-essential \
libssl-dev \
zlib1g-dev \
libbz2-dev \
libreadline-dev \
libsqlite3-dev \
wget \
curl \
llvm \
libncurses5-dev \
libncursesw5-dev \
xz-utils \
tk-dev \
libffi-dev \
liblzma-dev \
git \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
RUN curl -s -S -L https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer | bash && \
git clone https://github.com/momo-lab/pyenv-install-latest.git "$(pyenv root)"/plugins/pyenv-install-latest && \
pyenv install-latest "3.10" && \
pyenv global $(pyenv install-latest --print "3.10") && \
pip install "wheel<1"
COPY .cog/tmp/build366469108/cog-0.0.1.dev-py3-none-any.whl /tmp/cog-0.0.1.dev-py3-none-any.whl
RUN --mount=type=cache,target=/root/.cache/pip pip install /tmp/cog-0.0.1.dev-py3-none-any.whl
RUN --mount=type=cache,target=/var/cache/apt apt-get update -qq && apt-get install -qqy ffmpeg libsm6 libxext6 && rm -rf /var/lib/apt/lists/*
COPY .cog/tmp/build366469108/requirements.txt /tmp/requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip pip install -r /tmp/requirements.txt
WORKDIR /src
EXPOSE 5000
CMD ["python", "-m", "cog.server.http"]
COPY . /src
docker images --no-trunc|grep cuda yields no output. did you mean something else?
I'm using cog 0.7.2, and I receive another error. Is there a workaround for this as well?
[+] Building 0.6s (7/7) FINISHED
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 1.89kB 0.0s
=> resolve image config for docker.io/docker/dockerfile:1.2 0.3s
=> CACHED docker-image://docker.io/docker/dockerfile:1.2@sha256:e2a8561e419ab1ba6b2fe6cbdf49fd92b95912df1cf7d313c3e2230a333fdbcc 0.0s
=> [internal] load .dockerignore 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> ERROR [internal] load metadata for docker.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04 0.2s
------
> [internal] load metadata for docker.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04:
------
Dockerfile:1
--------------------
1 | >>> # syntax = docker/dockerfile:1.2
2 | FROM nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04
3 | ENV DEBIAN_FRONTEND=noninteractive
--------------------
ERROR: failed to solve: docker.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04: not found
ⅹ Failed to build Docker image: exit status 1
Hi @Glavin001. Thanks for your help and patience as we try to debug this issue. I apologize for the inconvenience this caused.
We just released Cog v0.8.2. This release includes #1231, which reverts #1161, which we believe to be the cause of the regression you're seeing.
Please give that a try when you have a chance and let us know if you're still having this issue. Thanks! 🙏
Impact: I'm unable to build any image using Cog and therefore deploy any models to Replicate.
On both Lambdalabs and TensorDock:
cog.yaml
:I receive the following error logs: