replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
8.07k stars 561 forks source link

Tensorflow is erased from python_packages and not installed #710

Open andreasjansson opened 2 years ago

andreasjansson commented 2 years ago

The following cog.yaml will not install any version of tensorflow:

build:
  gpu: true
  cuda: "11.6.2"
  python_version: "3.10"
  python_packages:
    - "diffusers==0.2.4"
    - "torch==1.12.1 --extra-index-url=https://download.pytorch.org/whl/cu116"
    - "ftfy==6.1.1"
    - "scipy==1.9.0"
    - "transformers==4.21.1"

    # FILM requirements
    - "tensorflow==2.8.0"
    - "tensorflow-datasets==4.4.0"
    - "tensorflow-addons==0.16.1"
    - "absl-py==0.12.0"
    - "gin-config==0.5.0"
    - "parameterized==0.8.1"
    - "mediapy==1.0.3"
    - "scikit-image==0.19.1"
    - "gdown==4.4.0"

  run:
    - "git clone https://github.com/google-research/frame-interpolation.git /frame-interpolation"

predict: "predict_animate.py:Predictor"
image: "r8.im/andreasjansson/stable-diffusion-animation"

However, when I add pip install tensorflow==2.8.0 to run: it works.

My guess is that something in the CUDA compatibility logic is breaking.

bfirsh commented 2 years ago

Probably broken in #696 or #697.

Not sure whether it's this issue specifically, but every time we run the scraper something seems to break, and quite often versions are removed because bits of the scraper is broken.

Like I think we've talked about before, I wonder whether we have a manually created matrix, with a script to aid us adding new things? Will ensure old things don't break.

bfirsh commented 2 years ago

Investigating this a bit and I don't think this is related to #696 or #697. I think this is a problem with installing both torch and tensorflow together. I reckon that's just not possible right now and it fails silently. A quick fix here would be to simply throw an error if you try to do that? Or just not do CUDA version resolution on Tensorflow, or something.

onorabil commented 1 year ago

I can confirm this issue affects the latest cog version (0.6.1) and the tensorflow x pytorch combination.

Works with gpu: false though.

I believe the issue is here: https://github.com/replicate/cog/blob/1f8fec1a52eb407d4be4271726ce29f46f8e543b/pkg/config/compatibility.go#L260

Not really familiar with go, but could try a PR if anyone is interested. I see no reason for failure if compatible versions are available.