replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
7.83k stars 545 forks source link

Fail while pushing with --separate-weights #1323

Open wnakano opened 11 months ago

wnakano commented 11 months ago

Today I started to face the following issue, while using cog push --separate-weights Although I was able to push the model without the flag --separate-weights

On the error below, I just replaced the project and model name by and <model-name, respectively.

$ cog push --separate-weights
⚠ Cog doesn't know if CUDA 11.2.2 is compatible with PyTorch 1.13.1. This might cause CUDA problems.
Building Docker image from environment in cog.yaml as r8.im/<project-name>/<model-name> ...
Weights unchanged, skip rebuilding and use cached image...
[+] Building 4.0s (7/7) FINISHED                                                      docker:default
 => [internal] load .dockerignore                                                               0.0s
 => => transferring context: 22.25kB                                                            0.0s
 => [internal] load build definition from Dockerfile                                            0.0s
 => => transferring dockerfile: 4.41kB                                                          0.0s
 => resolve image config for docker.io/docker/dockerfile:1.4                                    1.6s
 => CACHED docker-image://docker.io/docker/dockerfile:1.4@sha256:9ba7531bd80fb0a858632727cf7a1  0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04          1.2s
 => ERROR [internal] load metadata for r8.im/<project-name>/<model-name>  2.2s
 => [auth] <project-name>/<model-name> -weights:pull token for r8.im    0.0s
------
 > [internal] load metadata for r8.im/<project-name>/<model-name>-weights:latest:
------
Dockerfile:2
--------------------
   1 |     #syntax=docker/dockerfile:1.4
   2 | >>> FROM r8.im/<project-name>/<model-name>-weights AS weights
   3 |     FROM nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04
   4 |     ENV DEBIAN_FRONTEND=noninteractive
--------------------
ERROR: failed to solve: failed to authorize: failed to fetch oauth token: unexpected status from GET request to https://r8.im/_token?scope=repository%3A<project-name>%2F<model-name>-weights%3Apull&service=us-docker.pkg.dev: 404 Not Found
ⅹ Failed to build runner Docker image: Failed to build Docker image: exit status 1
andreemic commented 11 months ago

same here!

usamaehsan commented 10 months ago

facing same issue

a-sane commented 9 months ago

facing same issue today, but week ago it works well with --separate-weights

ynie commented 9 months ago

can we get some help on this?

ynie commented 9 months ago

@hongchaodeng I saw you implemented this feature. Do you know what's going on? Thank you so much!!

masahiro-koga-jai commented 7 months ago

I got a similar error, but deleting "path/to/your/cog_project/.dockerignore" and "path/to/your/cog_project/.dockerignore/.cog" files solved it for me.

hervenivon commented 4 months ago

I faced a similar issue too. The docker build was failing to find the copied data.

=> ERROR [1/4] COPY checkpoints/canny /src/checkpoints/canny                                                                                                                                                                                                                                                                                                        0.0s
 => ERROR [2/4] COPY checkpoints/ip_adapter /src/checkpoints/ip_adapter                                                                                                                                                                                                                                                                                              0.0s
 => ERROR [3/4] COPY checkpoints/tile /src/checkpoints/tile                                                                                                                                                                                                                                                                                                          0.0s
 => ERROR [4/4] COPY checkpoints/vae /src/checkpoints/vae
...
Dockerfile:11
--------------------
   9 |     COPY checkpoints/canny /src/checkpoints/canny
  10 |     COPY checkpoints/ip_adapter /src/checkpoints/ip_adapter
  11 | >>> COPY checkpoints/vae /src/checkpoints/vae
--------------------
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref 46e45d4e-74bc-4316-b8d3-ef813683c1c8::umpry926pu2og534hz3uqwpxt: "checkpoints/vae": not found

while the file was actually here.

ynie commented 4 months ago

I stopped using replicate due to the poor tech support and framework.

hervenivon commented 4 months ago

What are you using as a replacement?

ynie commented 4 months ago

Runpod is way better with better support.

hervenivon commented 4 months ago

PS: like @masahiro-koga-jai, deleting the .dockerignore solved it for me. The .dockerignore is updated during cog build, and it obviously conflicts.

I got a similar error, but deleting "path/to/your/cog_project/.dockerignore" and "path/to/your/cog_project/.dockerignore/.cog" files solved it for me.

emcmanus commented 4 months ago

@ynie @hervenivon This and some other issues lead to a frustrating DX on Replicate, but YMMV building on Runpod. Personally my experience matches the reports here https://www.reddit.com/r/LocalLLaMA/comments/17il9n3/experience_on_runpod/

(I would definitely prefer Runpod's 4090's over A40's for image gen – they're half the price and twice as fast.)

emcmanus commented 4 months ago

You may also need to rm -r .cog/. I believe I got this error after a bad cog push --separate-weights.

My guess is r8.im/<project-name>/<model-name>-weights gets created on the first invocation, only.

Deleting Cog's build folder seems to have forced it to create the missing image.

ynie commented 4 months ago

I'm still shocked that this is still an issue after so many months. I remember wasting so many hours trying to fix this. Does anyone working at Replicate care?

emcmanus commented 4 months ago

Based on their Discord, my sense is they're absolutely swamped by end-users who mostly want to use the web frontends for various tools. Ideally Replicate knows this is not their core business, but I'm not so sure. I suspect they're feeling stronger PMF on the front-end than on the infra side of things.

hervenivon commented 4 months ago

Actually, I find cog super convenient for some of the projects I'm working on, but I do agree that the UX has some flaws.

Glad to find support in the community. Thanks! 🙏

narendraadloid commented 3 months ago

PS: like @masahiro-koga-jai, deleting the .dockerignore solved it for me. The .dockerignore is updated during cog build, and it obviously conflicts.

I got a similar error, but deleting "path/to/your/cog_project/.dockerignore" and "path/to/your/cog_project/.dockerignore/.cog" files solved it for me.

yes, I had added .cog/ in .dockerignore file, removing it solved the problem for me