nextcloud / recognize

👁 👂 Smart media tagging for Nextcloud: recognizes faces, objects, landscapes, music genres
https://apps.nextcloud.com/apps/recognize
GNU Affero General Public License v3.0
531 stars 42 forks source link

Movinet fails in GPU mode #1122

Open bugsyb opened 4 months ago

bugsyb commented 4 months ago

Which version of recognize are you using?

6.11

Enabled Modes

Object recognition, Face recognition, Video recognition, Music recognition

TensorFlow mode

GPU mode

Downstream App

Memories App

Which Nextcloud version do you have installed?

28.0.4.1

Which Operating system do you have installed?

Ubuntu 20.0.4.4

Which database are you running Nextcloud on?

Postgres

Which Docker container are you using to run Nextcloud? (if applicable)

28.0.4.1

How much RAM does your server have?

32

What processor Architecture does your CPU have?

x86_64

Describe the Bug

Seems like after upgrade to NC 28.0.4.1 & Recognize 6.1.1 it started to report below. It seems like it launches process, based on looking at nvidia-smi and it stays there though doesn't seem like it puts a load on GPU.

Classifier process output: Error: Session fail to run with error: 2 root error(s) found.
  (0) NOT_FOUND: could not find registered platform with id: 0x7fd379c7fae4
\t [[{{node movinet_classifier/movinet/stem/stem/conv3d/StatefulPartitionedCall}}]]
\t [[StatefulPartitionedCall/_1555]]
  (1) NOT_FOUND: could not find registered platform with id: 0x7fd379c7fae4
\t [[{{node movinet_classifier/movinet/stem/stem/conv3d/StatefulPartitionedCall}}]]
0 successful operations.
0 derived errors ignored.
    at NodeJSKernelBackend.runSavedModel (/var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-node-gpu/dist/nodejs_kernel_backend.js:461:43)
    at TFSavedModel.predict (/var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-node-gpu/dist/saved_model.js:341:43)
    at MovinetModel.predict (/var/www/html/custom_apps/recognize/src/movinet/MovinetModel.js:46:21)
    at /var/www/html/custom_apps/recognize/src/movinet/MovinetModel.js:95:24
    at /var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-core/dist/tf-core.node.js:4559:22
    at Engine.scopedRun (/var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-core/dist/tf-core.node.js:4569:23)
    at Engine.tidy (/var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-core/dist/tf-core.node.js:4558:21)
    at Object.tidy (/var/www/html/custom_apps/recognize/node_modules/@tensorflow/tfjs-core/dist/tf-core.node.js:8291:19)
    at MovinetModel.inference (/var/www/html/custom_apps/recognize/src/movinet/MovinetModel.js:92:21)
    at runMicrotasks (<anonymous>)

At the same time it seems to have all what's needed (btw. it didn't raise this error prior upgrades):

./bin/node src/test_gputensorflow.js 
2024-04-07 21:37:06.584377: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-07 21:37:06.593729: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:06.636405: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:06.636674: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:07.109813: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:07.110064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:07.110196: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-04-07 21:37:07.110354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3393 MB memory:  -> device: 0, name: Quadro M1200, pci bus id: 0000:01:00.0, compute capability: 5.0

And models seem to be there too:

du -shx ./models/*
50M ./models/efficientnet_lite4
794M    ./models/efficientnetv2
22M ./models/landmarks_africa
41M ./models/landmarks_asia
41M ./models/landmarks_europe
41M ./models/landmarks_north_america
31M ./models/landmarks_oceania
41M ./models/landmarks_south_america
47M ./models/movinet-a3
31M ./models/musicnn

Unsure where to look for an issue. It is after ffmpeg finishes its job.

Thanks!

Expected Behavior

Would just proceed to classify.

To Reproduce

Unsure.

Debug log

No response

github-actions[bot] commented 4 months ago

Hello :wave:

Thank you for taking the time to open this issue with recognize. I know it's frustrating when software causes problems. You have made the right choice to come here and open an issue to make sure your problem gets looked at and if possible solved. I try to answer all issues and if possible fix all bugs here, but it sometimes takes a while until I get to it. Until then, please be patient. Note also that GitHub is a place where people meet to make software better together. Nobody here is under any obligation to help you, solve your problems or deliver on any expectations or demands you may have, but if enough people come together we can collaborate to make this software better. For everyone. Thus, if you can, you could also look at other issues to see whether you can help other people with your knowledge and experience. If you have coding experience it would also be awesome if you could step up to dive into the code and try to fix the odd bug yourself. Everyone will be thankful for extra helping hands! One last word: If you feel, at any point, like you need to vent, this is not the place for it; you can go to the forum, to twitter or somewhere else. But this is a technical issue tracker, so please make sure to focus on the tech and keep your opinions to yourself. (Also see our Code of Conduct. Really.)

I look forward to working with you on this issue Cheers :blue_heart:

bugsyb commented 4 months ago

Could be related to: https://github.com/nextcloud/recognize/issues/1060

In my case it is CUDA 11

#nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0