GPU Acceleration - Githubissues

ShadowTime1290 commented 1 year ago

What would be the best way to enable GPU acceleration, if at all possible?

It looks to be supported through native installation.

https://www.obico.io/docs/server-guides/advanced/nvidia-gpu/

hydazz commented 1 year ago

It is not currently possible here, as this image is based on ubuntu 20.04 rather than 16.04, for which the models are compiled. I don't know whether this matters, I am investigating it now.

GPU acceleration also will add 2GB to the image size as the cuda toolkit is massive

Edit: indeed, models need to be rebuild: OSError: /lib/x86_64-linux-gnu/libcublas.so.9.0: version 'libcublas.so.9.0' not found (required by /app/obico/ml_api/bin/model_gpu_x86_64.so), its probably possible to install these packages from the bionic source, but will probably introduce more issues

autumnwalker commented 1 year ago

+1 for GPU acceleration - would be really nice to get CUDA support.

Is there an option to use an external AI solution like CodeProject.AI or something that already has CUDA support?

hydazz commented 1 year ago

this could possible be supported in the new obico release, once i get my head around all the pip packages conflicting

https://github.com/imagegenius/docker-obico/pull/11

hydazz commented 1 year ago

would you take a 3.5gb+ image for gpu acceleration over a 1gb image?

autumnwalker commented 1 year ago

would you take a 3.5gb+ image for gpu acceleration over a 1gb image?

I would!

hydazz commented 1 year ago

try give ghcr.io/imagegenius/igpipepr-obico:bfdccef9-pkg-8d778c6a-pr-12 a shot, the image is 7.69GB uncompressed, so yea make some room...

as stated in the obico docs, you'll see

...
obico-server-ml_api-1  | ----- Trying to load weights: /app/lib/../model/model-weights.xxxx - **use_gpu = True** -----
...
Succeeded!
...

if its using the GPU

autumnwalker commented 1 year ago

Working! I see it find my GPU in the logs during startup.

I am running this using Unraid so I had to add "--runtime=nvidia" to "Extra Parameters" and I had to add two variables: "NVIDIA_VISIBLE_DEVICES" and "NVIDIA_DRIVER_CAPABILITIES".

hydazz commented 1 year ago

Working! I see it find my GPU in the logs during startup.

I am running this using Unraid so I had to add "--runtime=nvidia" to "Extra Parameters" and I had to add two variables: "NVIDIA_VISIBLE_DEVICES" and "NVIDIA_DRIVER_CAPABILITIES".

You can just set —-gpus=all in extra parameters, no need for setting runtime or those variables

hydazz commented 1 year ago

Working! I see it find my GPU in the logs during startup.

I am running this using Unraid so I had to add "--runtime=nvidia" to "Extra Parameters" and I had to add two variables: "NVIDIA_VISIBLE_DEVICES" and "NVIDIA_DRIVER_CAPABILITIES".

no gripes?

autumnwalker commented 1 year ago

You can just set —-gpus=all in extra parameters, no need for setting runtime or those variables

You are correct! That worked as well. Thanks!

autumnwalker commented 1 year ago

no gripes?

Appears to be working! Thank you.

hydazz commented 1 year ago

i couldent get it to work with just the CPU after installing the CUDA dependencies

Let's try darknet lib built with GPU support - /darknet/libdarknet_gpu.so
Done! Hooray! Now we have darknet with GPU support.

----- Trying to load weights: /app/obico/ml_api/lib/../model/model-weights.darknet - use_gpu = True -----
 Try to load cfg: /app/obico/ml_api/model/model.cfg, weights: /app/obico/ml_api/lib/../model/model-weights.darknet, clear = 0 
CUDA status Error: file: ./src/dark_cuda.c: func: get_gpu_compute_capability() line: 619

 CUDA Error: no CUDA-capable device is detected
/usr/bin/python3.7: get_gpu_compute_capability: Unknown error 368483375
[2023-06-29 22:48:28 +1000] [2031] [INFO] Booting worker with pid: 2031

Let's try darknet lib built with GPU support - /darknet/libdarknet_gpu.so
Done! Hooray! Now we have darknet with GPU support.

worker: Warm shutdown (MainProcess)
[2023-06-29 22:48:28 +1000] [1278] [INFO] Handling signal: term
[2023-06-29 12:48:28,725: INFO/MainProcess] beat: Shutting down...
----- Trying to load weights: /app/obico/ml_api/lib/../model/model-weights.darknet - use_gpu = True -----
 Try to load cfg: /app/obico/ml_api/model/model.cfg, weights: /app/obico/ml_api/lib/../model/model-weights.darknet, clear = 0 
CUDA status Error: file: ./src/dark_cuda.c: func: get_gpu_compute_capability() line: 619

 CUDA Error: no CUDA-capable device is detected
/usr/bin/python3.7: get_gpu_compute_capability: Unknown error 368483375
[2023-06-29 22:48:28 +1000] [1278] [INFO] Shutting down: Master

(kept repeating this non-stop)

so im seperating the branches into main and cuda - self explanatory names, once :cuda is ready for testing, hopfully someone here will be willing...

autumnwalker commented 1 year ago

Can I just throw the :cuda branch onto mine and pull?

hydazz commented 1 year ago

yes, once i get it to build

autumnwalker commented 1 year ago

Ok. Standing by.

hydazz commented 1 year ago

ghcr.io/imagegenius/obico:cuda - give it a shot, after you can revert back to ghcr.io/imagegenius/obico:bfdccef9-ig57 (same as what you're probably on now), :latest does not include GPU dependencies anymore

autumnwalker commented 1 year ago

Currently on ghcr.io/imagegenius/igpipepr-obico:bfdccef9-pkg-8d778c6a-pr-12

I have switched to ghcr.io/imagegenius/obico:cuda

Here is what I'm seeing in the logs - Obico does boot despite this.

2023-06-29 12:19:16.740658694 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:541 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
[2023-06-29 15:19:17,543: INFO/MainProcess] Connected to redis://192.168.2.6:6380//
[2023-06-29 15:19:17,551: INFO/MainProcess] mingle: searching for neighbors
[2023-06-29 15:19:17,628: INFO/Beat] beat: Starting...

Let's try darknet lib built with GPU support - /darknet/libdarknet_gpu.so
Nope! Failed to load darknet lib built with GPU support. erors=libcudnn.so.8: cannot open shared object file: No such file or directory
Now let's try darknet lib on CPU - /darknet/libdarknet_cpu.so
Error during importing YoloNet! - /darknet/libdarknet_cpu.so: cannot open shared object file: No such file or directory
----- Trying to load weights: /app/obico/ml_api/lib/../model/model-weights.darknet - use_gpu = True -----
Failed! - Not loading darknet net due to previous import failure. Check earlier log for errors.
----- Trying to load weights: /app/obico/ml_api/lib/../model/model-weights.onnx - use_gpu = True -----
Succeeded!

hydazz commented 1 year ago

Currently on ghcr.io/imagegenius/igpipepr-obico:bfdccef9-pkg-8d778c6a-pr-12

I have switched to ghcr.io/imagegenius/obico:cuda

Here is what I'm seeing in the logs - Obico does boot despite this.

2023-06-29 12:19:16.740658694 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:541 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
[2023-06-29 15:19:17,543: INFO/MainProcess] Connected to redis://192.168.2.6:6380//
[2023-06-29 15:19:17,551: INFO/MainProcess] mingle: searching for neighbors
[2023-06-29 15:19:17,628: INFO/Beat] beat: Starting...

Let's try darknet lib built with GPU support - /darknet/libdarknet_gpu.so
Nope! Failed to load darknet lib built with GPU support. erors=libcudnn.so.8: cannot open shared object file: No such file or directory
Now let's try darknet lib on CPU - /darknet/libdarknet_cpu.so
Error during importing YoloNet! - /darknet/libdarknet_cpu.so: cannot open shared object file: No such file or directory
----- Trying to load weights: /app/obico/ml_api/lib/../model/model-weights.darknet - use_gpu = True -----
Failed! - Not loading darknet net due to previous import failure. Check earlier log for errors.
----- Trying to load weights: /app/obico/ml_api/lib/../model/model-weights.onnx - use_gpu = True -----
Succeeded!

try repulling the latest :cuda

I dont get these errors, however my printer is not connected, so darknet is probably not initialising w/o the printer connected

[2023-06-30 09:23:20 +1000] [775] [INFO] Starting gunicorn 19.9.0
[2023-06-30 09:23:20 +1000] [775] [INFO] Listening at: http://0.0.0.0:3333 (775)
[2023-06-30 09:23:20 +1000] [775] [INFO] Using worker: sync
[2023-06-30 09:23:20 +1000] [808] [INFO] Booting worker with pid: 808
django.db.backends DEBUG    (0.002) 
            SELECT name, type FROM sqlite_master
            WHERE type in ('table', 'view') AND NOT name='sqlite_sequence'
            ORDER BY name; args=None
django.db.backends DEBUG    (0.000) SELECT "django_migrations"."app", "django_migrations"."name" FROM "django_migrations"; args=()
2023-06-30 09:23:21.405498760 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:541 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
[2023-06-29 23:23:21,620: INFO/MainProcess] Connected to redis://localhost:6379//
[2023-06-29 23:23:21,623: INFO/MainProcess] mingle: searching for neighbors
[2023-06-29 23:23:21,639: INFO/Beat] beat: Starting...
[2023-06-29 23:23:22,630: INFO/MainProcess] mingle: all alone
[2023-06-29 23:23:22,635: WARNING/MainProcess] /usr/lib/python3.7/site-packages/celery/fixups/django.py:206: UserWarning: Using settings.DEBUG leads to a memory
            leak, never use this setting in production environments!
  leak, never use this setting in production environments!''')
[2023-06-29 23:23:22,635: INFO/MainProcess] celery@701544e29c04 ready.

autumnwalker commented 1 year ago

Pulled latest. Getting:

2023-06-30 08:35:44.908443042 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:541 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
[2023-06-30 11:35:45,754: INFO/Beat] beat: Starting...
[2023-06-30 11:35:45,921: INFO/MainProcess] Connected to redis://192.168.2.6:6380//
[2023-06-30 11:35:45,929: INFO/MainProcess] mingle: searching for neighbors

Let's try darknet lib built with GPU support - /darknet/libdarknet_gpu.so
Nope! Failed to load darknet lib built with GPU support. erors=libcudnn.so.8: cannot open shared object file: No such file or directory
Now let's try darknet lib on CPU - /darknet/libdarknet_cpu.so
Done! Darknet is now running on CPU.

----- Trying to load weights: /app/obico/ml_api/lib/../model/model-weights.darknet - use_gpu = True -----
Failed! - I respectfully decline to load the net as I am asked to use GPU but the loaded darknet module does NOT have GPU support
----- Trying to load weights: /app/obico/ml_api/lib/../model/model-weights.onnx - use_gpu = True -----
Succeeded!

hydazz commented 1 year ago

Pulled latest. Getting:

2023-06-30 08:35:44.908443042 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:541 CreateExecutionProviderInstance] Failed to create CUDAExecutionProvider. Please reference https://onnxruntime.ai/docs/reference/execution-providers/CUDA-ExecutionProvider.html#requirements to ensure all dependencies are met.
[2023-06-30 11:35:45,754: INFO/Beat] beat: Starting...
[2023-06-30 11:35:45,921: INFO/MainProcess] Connected to redis://192.168.2.6:6380//
[2023-06-30 11:35:45,929: INFO/MainProcess] mingle: searching for neighbors

Let's try darknet lib built with GPU support - /darknet/libdarknet_gpu.so
Nope! Failed to load darknet lib built with GPU support. erors=libcudnn.so.8: cannot open shared object file: No such file or directory
Now let's try darknet lib on CPU - /darknet/libdarknet_cpu.so
Done! Darknet is now running on CPU.

----- Trying to load weights: /app/obico/ml_api/lib/../model/model-weights.darknet - use_gpu = True -----
Failed! - I respectfully decline to load the net as I am asked to use GPU but the loaded darknet module does NOT have GPU support
----- Trying to load weights: /app/obico/ml_api/lib/../model/model-weights.onnx - use_gpu = True -----
Succeeded!

apologies for the delay, this should have been fixed a few weeks ago https://github.com/imagegenius/docker-obico/commit/d4f608b33a95b8e6236834e0b85826ef95b0296b

autumnwalker commented 1 year ago

Updated! Now getting the following:

CUDA Error: forward compatibility was attempted on non supported HW
[2023-07-24 11:31:31 -0300] [3179] [INFO] Booting worker with pid: 3179
/usr/bin/python3.7: get_gpu_compute_capability: Unknown error -1645007825
 Try to load cfg: /app/obico/ml_api/model/model.cfg, weights: /app/obico/ml_api/lib/../model/model-weights.darknet, clear = 0 
CUDA status Error: file: ./src/dark_cuda.c: func: get_gpu_compute_capability() line: 619

hydazz commented 1 year ago

Updated! Now getting the following:

CUDA Error: forward compatibility was attempted on non supported HW
[2023-07-24 11:31:31 -0300] [3179] [INFO] Booting worker with pid: 3179
/usr/bin/python3.7: get_gpu_compute_capability: Unknown error -1645007825
 Try to load cfg: /app/obico/ml_api/model/model.cfg, weights: /app/obico/ml_api/lib/../model/model-weights.darknet, clear = 0 
CUDA status Error: file: ./src/dark_cuda.c: func: get_gpu_compute_capability() line: 619

🙄 please standby

hydazz commented 1 year ago

were you using the :cuda branch? works for me...

root@Discovery:~# nvidia-smi
Sun Jul 30 14:07:10 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660 ...    Off | 00000000:01:00.0 Off |                  N/A |
|  0%   49C    P8              18W / 125W |    827MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      4002      C   /usr/bin/python3.7                          824MiB |
+---------------------------------------------------------------------------------------+

  23 [2023-07-30 04:06:13,604: INFO/MainProcess] mingle: all alone
[2023-07-30 04:06:13,609: WARNING/MainProcess] /usr/lib/python3.7/site-packages/celery/fixups/django.py:206: UserWarning: Using settings.DEBUG leads to a memory
            leak, never use this setting in production environments!
  leak, never use this setting in production environments!''')
[2023-07-30 04:06:13,609: INFO/MainProcess] celery@a999403393ec ready.
conv   1024       3 x 3/ 1     13 x  13 x1024 ->   13 x  13 x1024 3.190 BF
  24 conv   1024       3 x 3/ 1     13 x  13 x1024 ->   13 x  13 x1024 3.190 BF
  25 route  16                                     ->   26 x  26 x 512 
  26 conv     64       1 x 1/ 1     26 x  26 x 512 ->   26 x  26 x  64 0.044 BF
  27 reorg                    / 2   26 x  26 x  64 ->   13 x  13 x 256
  28 route  27 24                                  ->   13 x  13 x1280 
  29 conv   1024       3 x 3/ 1     13 x  13 x1280 ->   13 x  13 x1024 3.987 BF
  30 conv     30       1 x 1/ 1     13 x  13 x1024 ->   13 x  13 x  30 0.010 BF
  31 detection
mask_scale: Using default '1.000000'
Total BFLOPS 29.338 
avg_outputs = 607364 
 Allocate additional workspace_size = 131.08 MB 
 Try to load cfg: /app/obico/ml_api/model/model.cfg, weights: /app/obico/ml_api/lib/../model/model-weights.darknet, clear = 0 
net.optimized_memory = 0 
mini_batch = 1, batch = 8, time_steps = 1, train = 0 
Create CUDA-stream - 0 
 Create cudnn-handle 0 
 Try to load weights: /app/obico/ml_api/lib/../model/model-weights.darknet 
Loading weights from /app/obico/ml_api/lib/../model/model-weights.darknet...Done! Loaded 32 layers from weights-file

using this compose for reference:

version: "3"

services:
  obico:
    image: ghcr.io/imagegenius/obico:cuda
    container_name: obico
    env_file: stack.env
    volumes:
      - /mnt/user/appdata/obico:/config
    networks:
      br0.2:
        ipv4_address: 192.168.2.3
    ports:
      - 3334:3334
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    # unraid labels
    labels:
      - net.unraid.docker.webui=http://[IP]:[PORT:3334]
      - net.unraid.docker.icon=https://raw.githubusercontent.com/imagegenius/templates/main/unraid/img/obico.png

networks:
  br0.2:
    external: true

autumnwalker commented 1 year ago

Yup.

Pulled the latest :cuda. Same issue.

hydazz commented 1 year ago

Yup.

Pulled the latest :cuda. Same issue.

Does nvidia-smi show python 3.7?

autumnwalker commented 1 year ago

Python 3.8.

hydazz commented 1 year ago

@autumnwalker is there a newer version of the driver available for your gpu? 525.89.02 is old

CUDA Error: forward compatibility was attempted on non supported HW is sticking out to me

autumnwalker commented 1 year ago

Ahh! That appears to have done the trick. Thank you, and apologies for the wild goose chase.

imagegenius / docker-obico

GPU Acceleration #7