replicate / cog

Containers for machine learning
https://cog.run
Apache License 2.0
8.07k stars 561 forks source link

AssertionError: Torch not compiled with CUDA enabled #1397

Closed whenmoon closed 11 months ago

whenmoon commented 11 months ago

This is not necessarily a bug with cog, but a problem I'm having running a prediction model with cog. I am running a local version of this model on Replicate: https://replicate.com/schananas/grounded_sam. I have cloned the repo here: https://github.com/schananas/grounded_sam_replicate but after running sudo cog predict I get the error raise AssertionError("Torch not compiled with CUDA enabled"). The full trace is:

Running prediction: ae54a17e-03e6-4f09-b89a-feaf77ddbde6...
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/cog/server/worker.py", line 222, in _predict
for r in result:
File "/usr/local/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 43, in generator_context
response = gen.send(None)
File "/src/predict.py", line 79, in predict
annotated_picture_mask, neg_annotated_picture_mask, mask, inverted_mask = run_grounding_sam(image,
File "/src/grounded_sam.py", line 81, in run_grounding_sam
annotated_frame, detected_boxes = detect(image, image_source, positive_prompt, groundingdino_model)
File "/src/grounded_sam.py", line 37, in detect
boxes, logits, phrases = predict(
File "/src/weights/GroundingDINO/groundingdino/util/inference.py", line 64, in predict
model = model.to(device)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in to
return self._apply(convert)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 662, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 985, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 221, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

I am using 0.8.6, python 3.10 pytorch 1.13.0 and torchvision 0.14.0 on macOS 12.6. I have these settings in code:

cog.ymal

build:
  gpu: false

predict.py:

os.environ['BUILD_WITH_CUDA'] = 'false'

Any help much appreciated!

dkhokhlov commented 11 months ago

for https://github.com/schananas/grounded_sam_replicate I am getting different error with latest cog from main branch on linux host w/o cuda:

Starting Docker image cog-groundedsamreplicate-base and running setup()...
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Missing device driver, re-trying without GPU
Error response from daemon: page not found
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/cog/server/http.py", line 403, in <module>
    app = create_app(
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/cog/server/http.py", line 94, in create_app
    predictor = load_predictor_from_ref(predictor_ref)
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/cog/predictor.py", line 192, in load_predictor_from_ref
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/src/predict.py", line 16, in <module>
    os.chdir("/src/weights/GroundingDINO")
FileNotFoundError: [Errno 2] No such file or directory: '/src/weights/GroundingDINO'
ⅹ Failed to get container status: exit status 1

it correctly detected missing CUDA driver:

Missing device driver, re-trying without GPU

but failed later on missing weights. is it expected?

dkhokhlov commented 11 months ago

After downloading weights using script:

pip install huggingface_hub
python script/download_weights.py

now I am getting different error - model setup is unhappy:

File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/cog/predictor.py", line 73, in run_setup
predictor.setup()
File "/src/predict.py", line 49, in setup
self.groundingdino_model = load_model_hf(device)
File "/src/predict.py", line 40, in load_model_hf
args = SLConfig.fromfile(cache_config_file)
File "/src/weights/GroundingDINO/groundingdino/util/slconfig.py", line 185, in fromfile
cfg_dict, cfg_text = SLConfig._file2dict(filename)
File "/src/weights/GroundingDINO/groundingdino/util/slconfig.py", line 79, in _file2dict
check_file_exist(filename)
File "/src/weights/GroundingDINO/groundingdino/util/slconfig.py", line 23, in check_file_exist
raise FileNotFoundError(msg_tmpl.format(filename))
FileNotFoundError: file "/home/dmitri/SOURCE/Thirdparty/replicate/grounded_sam_replicate/weights/models--ShilongLiu--GroundingDINO/snapshots/a94c9b567a2a374598f05c584e96798a170c56fb/GroundingDINO_SwinB.cfg.py" does not exist
ⅹ Model setup failed
dkhokhlov commented 11 months ago

OK, after correctly installing weights inside container using:

cog run script/download_weights.py

I reproduced the original _lazy_init error with latest Cog:

File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/nn/modules/module.py", line 985, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/root/.pyenv/versions/3.10.13/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Looking into it.

dkhokhlov commented 11 months ago

The problem is in GroundingDINO source.

Here is the patch to make it working: fix_cuda_to_cpu.patch

Steps:

git clone https://github.com/schananas/grounded_sam_replicate
cd grounded_sam_replicate

# download weights
cog run script/download_weights.py

# apply attached patch
git apply fix_cuda_to_cpu.patch

# run
cog predict
...
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Done!
Written output to output.0.jpg
Written output to output.1.jpg
Written output to output.2.jpg
Written output to output.3.jpg

Result: image

dkhokhlov commented 11 months ago

resolved