snowzach / doods2

API for detecting objects in images and video streams using Tensorflow
MIT License
221 stars 28 forks source link

amd64-gpu image is out of date #97

Closed brainwater closed 5 months ago

brainwater commented 7 months ago

The image snowzach/doods2:amd64-gpu is out of date and I believe isn't compatible with cuda 12.2

When running docker run -it -p 8080:8080 --gpu all snowzach/doods2:amd64-gpu I got the following error:

Traceback (most recent call last):
  File "main.py", line 8, in <module>
    from doods import Doods
  File "/opt/doods/doods.py", line 20, in <module>
    from detectors.pytorch import PyTorch
  File "/opt/doods/detectors/pytorch.py", line 7, in <module>
    import torch
  File "/usr/local/lib/python3.8/dist-packages/torch/__init__.py", line 229, in <module>
    from torch._C import *  # noqa: F403
ImportError: /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so: undefined symbol: cudaGraphDebugDotPrint, version libcudart.so.11.0

The image snowzach/doods2:amd64 worked fine on the same machine. This is a new installation of ubuntu server 22.04, with docker-engine installed via the instructions on the Docker website (i.e. not the ubuntu docker snap, since the snap is not compatible with gpu acceleration of containers).

I ran the following from within a container using the base image snowzach/doods2:amd64:

$ apt update
$ apt upgrade
# At this point, I was still getting the same error when i ran python3 main.py
$ pip install --upgrade pip
$ pip install --upgrade torch torchvision
# The following were to fix the unresolved dependency error that last command gave me
$ pip install --upgrade numpy
$ pip install --upgrade ultralytics
$ python3 main.py

At this point, I tested it and it ran much better and faster, presumably indicating it successfully used the GPU.

I assume the image would work if the image build process were run again, but I was unable to find any instructions on the build process. I'd also appreciate instructions on building doods2 images locally.

nvidia-smi output:

Wed Dec 13 00:57:04 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080        Off | 00000000:01:00.0 Off |                  N/A |
| 28%   35C    P8              11W / 180W |      2MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
keyboarderror commented 6 months ago

@brainwater Just thought I'd say I saw your post. I'd been having the same problem. Seeing your solution made me want to try again. This time starting with a fresh pull it worked immediately, something I'd never seen before. Running under WSL2. Just a bit of fiddling with WSL2 to get the port working beyond the local machine. No other changes necessary. So I can't say the issue is closed as I don't see any updates in the repository, but it's definitely working for me now.

snowzach commented 6 months ago

I did actually just rebuild the image and pushed it. It may have picked up some new stuff from the base image. I meant to post here but I forgot.

keyboarderror commented 6 months ago

Actually it seems I was mistaken when I said it was working. I neglected to add --gpus all to the docker run command initially. So it was only operating in CPU mode. When I added it the YOLOv5 startup lists the GPU instead of the CPU but exits without an error.

YOLOv5 🚀 2024-1-1 Python-3.8.10 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce GTX 970, 4096MiB)

So mine is probably a different issue at this point but I'd love to see it working with the current CUDA. Not sure how to proceed with troubleshooting.

keyboarderror commented 6 months ago

Apologies if this is a newbie question, but what is the minimum required compute capability for running this? I didn't see anything listed. My test case is currently 5.2. If it needs significantly higher I may need to rethink my ideas. I'm using it in the context of Home Assistant and I'm not sure what would satisfy the requirements.

snowzach commented 6 months ago

@keyboarderror honestly, I don't know what it requires... I don't do much with the Nvidia GPU side of things. I wouldn't think it requires very high as the model it uses is old but it's good.I really can't tell you to be sure.

snowzach commented 6 months ago

This is the version the container has currently:

root@7ebde0b3c926:/opt/doods# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
snowzach commented 5 months ago

Okay, I tried building with updated tensorflow quite a few times but something was wrong with Docker hub. I just finally tried again and it took. Maybe try now. I believe this will have updated Cuda.

keyboarderror commented 5 months ago

OK. It fails with or without enabling the GPU, but at least there's an error. It's the same trying to run on CPU. Doesn't appear to be a Cuda problem.

sudo docker run --gpus all -it -p 8080:8080 snowzach/doods2:amd64-gpu

Traceback (most recent call last): File "/opt/doods/main.py", line 5, in <module> from api import API File "/opt/doods/api.py", line 8, in <module> from fastapi import status, FastAPI, WebSocket, WebSocketDisconnect File "/usr/local/lib/python3.11/dist-packages/fastapi/__init__.py", line 7, in <module> from .applications import FastAPI as FastAPI File "/usr/local/lib/python3.11/dist-packages/fastapi/applications.py", line 3, in <module> from fastapi import routing File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 22, in <module> from fastapi.dependencies.models import Dependant File "/usr/local/lib/python3.11/dist-packages/fastapi/dependencies/models.py", line 3, in <module> from fastapi.security.base import SecurityBase File "/usr/local/lib/python3.11/dist-packages/fastapi/security/__init__.py", line 1, in <module> from .api_key import APIKeyCookie as APIKeyCookie File "/usr/local/lib/python3.11/dist-packages/fastapi/security/api_key.py", line 3, in <module> from fastapi.openapi.models import APIKey, APIKeyIn File "/usr/local/lib/python3.11/dist-packages/fastapi/openapi/models.py", line 103, in <module> class Schema(BaseModel): File "/usr/local/lib/python3.11/dist-packages/pydantic/main.py", line 369, in __new__ cls.__signature__ = ClassAttribute('__signature__', generate_model_signature(cls.__init__, fields, config))

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/pydantic/utils.py", line 231, in generate_model_signature merged_params[param_name] = Parameter( ^^^^^^^^^^ File "/usr/lib/python3.11/inspect.py", line 2715, in __init__ raise ValueError('{!r} is not a valid parameter name'.format(name)) ValueError: 'not' is not a valid parameter name

brainwater commented 5 months ago

I think the fix is to update pydantic within requirements.txt

I'm getting the same issue, ValueError: 'not' is not a valid parameter name. Here is a comment about it https://github.com/tiangolo/fastapi/issues/5048#issuecomment-1170204100 The issue looks like it's occurring on line 231 of pydantic/utils.py https://github.com/pydantic/pydantic/blob/v1.8.2/pydantic/utils.py#L231 Pydantic is pinned at an old (1.8.2) version within requirements.txt. The problem was identified in pydantic as of April of 2022, and a fix was merged into pydantic in August of 2022. Pydantic v1.8.2 is from 3 years ago, it doesn't have that fix, so updating pydantic to a recent version should fix the issue.

$ sudo docker run -it -p 8080:8080 --gpus all snowzach/doods2:amd64-gpu
Traceback (most recent call last):
  File "/opt/doods/main.py", line 5, in <module>
    from api import API
  File "/opt/doods/api.py", line 8, in <module>
    from fastapi import status, FastAPI, WebSocket, WebSocketDisconnect
  File "/usr/local/lib/python3.11/dist-packages/fastapi/__init__.py", line 7, in <module>
    from .applications import FastAPI as FastAPI
  File "/usr/local/lib/python3.11/dist-packages/fastapi/applications.py", line 3, in <module>
    from fastapi import routing
  File "/usr/local/lib/python3.11/dist-packages/fastapi/routing.py", line 22, in <module>
    from fastapi.dependencies.models import Dependant
  File "/usr/local/lib/python3.11/dist-packages/fastapi/dependencies/models.py", line 3, in <module>
    from fastapi.security.base import SecurityBase
  File "/usr/local/lib/python3.11/dist-packages/fastapi/security/__init__.py", line 1, in <module>
    from .api_key import APIKeyCookie as APIKeyCookie
  File "/usr/local/lib/python3.11/dist-packages/fastapi/security/api_key.py", line 3, in <module>
    from fastapi.openapi.models import APIKey, APIKeyIn
  File "/usr/local/lib/python3.11/dist-packages/fastapi/openapi/models.py", line 103, in <module>
    class Schema(BaseModel):
  File "/usr/local/lib/python3.11/dist-packages/pydantic/main.py", line 369, in __new__
    cls.__signature__ = ClassAttribute('__signature__', generate_model_signature(cls.__init__, fields, config))
                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pydantic/utils.py", line 231, in generate_model_signature
    merged_params[param_name] = Parameter(
                                ^^^^^^^^^^
  File "/usr/lib/python3.11/inspect.py", line 2715, in __init__
    raise ValueError('{!r} is not a valid parameter name'.format(name))
ValueError: 'not' is not a valid parameter name
$
snowzach commented 5 months ago

Okay, I just updated everything to tensorflow 2.14 which should have updated the CUDA version. Try it now.

keyboarderror commented 5 months ago

It's back to exiting without any errors. CPU mode works.

snowzach commented 5 months ago

It's back to exiting without any errors. CPU mode works.

But GPU does not?

keyboarderror commented 5 months ago

No. It just returns to the command prompt a couple moments after the message Fusing layers... No error message. In CPU mode it starts showing server messages and servicing requests.

brainwater commented 5 months ago

I'm getting the same using the gpu

blake@srv-docker:~$ sudo docker pull snowzach/doods2:amd64-gpu
<snipped>
Digest: sha256:d439e0c4d43d50d023fae5e8f3056ad20c68e086ccdfd1d61d6201ee8df843fa
Status: Downloaded newer image for snowzach/doods2:amd64-gpu
docker.io/snowzach/doods2:amd64-gpu
blake@srv-docker:~$ sudo docker run -it -p 8080:8080 --gpus all snowzach/doods2:amd64-gpu
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
2024-01-30 16:28:01,935 - doods.doods - INFO - Registered detector type:tflite name:default
2024-01-30 16:28:03,518 - doods.doods - INFO - Registered detector type:tensorflow name:tensorflow
/usr/local/lib/python3.11/dist-packages/torch/hub.py:294: UserWarning: You are about to download and run code from an untrusted repository. In a future release, this won't be allowed. To add the repository to your trusted list, change the command to {calling_fn}(..., trust_repo=False) and a command prompt will appear asking for an explicit confirmation of trust, or load(..., trust_repo=True), which will assume that the prompt is to be answered with 'yes'. You can also use load(..., trust_repo='check') which will only prompt for confirmation if the repo is not already trusted. This will eventually be the default behaviour
  warnings.warn(
Downloading: "https://github.com/ultralytics/yolov5/zipball/master" to /root/.cache/torch/hub/master.zip
YOLOv5 🚀 2024-1-30 Python-3.11.0rc1 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce GTX 1080, 8112MiB)

Downloading https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5s.pt to yolov5s.pt...
100%|█████████████████████████████████████████████████████████████████| 14.1M/14.1M [00:00<00:00, 62.9MB/s]

Fusing layers...
blake@srv-docker:~$

Tonight I'll see if I can debug it to get more details on exactly where it had an error.

brainwater commented 5 months ago

The problem is due to out-of-date apt packages. I can work around the issue by running an apt update && apt upgrade -y within the container before running doods2. There's 58 out-of-date apt packages, and 67 out-of-date pip packages.

blake@srv-docker:~$ sudo docker run --entrypoint=bash -it -p 8081:8080 --gpus all snowzach/doods2:amd64-gpu
<snipped>
root@e915b583aeee:/opt/doods# apt update
<snipped>
root@e915b583aeee:/opt/doods# apt upgrade
<snipped>
root@e915b583aeee:/opt/doods# python3 main.py api
<doods2 is now running>
keyboarderror commented 5 months ago

Confirmed that fixes it here too. Excellent. And thanks @brainwater for the --entrypoint=bash switch. I'm still pretty new to docker and couldn't figure out how to get a persistent shell if the container didn't want to run. Now I can poke around.

snowzach commented 5 months ago

Awesome! Thanks for tracking that down. I updated the Docker builds and pushed everything out. I even dug out my GTX970 and verified it runs now. Closing this issue., LMK if still problems.

brainwater commented 5 months ago

It's working for me now. Thanks for your work @snowzach !

keyboarderror commented 5 months ago

Yes, I pulled the update and it works immediately. Thank you very much @snowzach!