truenas / apps

GNU Lesser General Public License v3.0
78 stars 25 forks source link

Troubleshooting: Fail to smart search (server error) with unclear causes. #989

Closed Fireflaker closed 1 day ago

Fireflaker commented 1 day ago

Environment: Intel + Nvidia system from 2018, 8Gb physical memory allocated(enough), passphrase encrypted datasets, Cuda Machine Learning Image, over 100k media, latest immich on 24.10.

What happened: 2 months ago after an update, the smart search stopped working and never worked, and the people section remained empty. No apparent errors were found in the logs. Yesterday, after following through with the following. Namely, unlock DB and redis, turn on Nvidia in app settings, point GPU to app, select GPU, reboot full truenas system, unlock all datasets and restart Immich https://github.com/truenas/apps/issues/942 The GPU has been successfully attached to the app, and I still see no apparent errors in the logs. I let all ML jobs run overnight and it finished at a smart search speed of approx 200ms per image. This morning, the issue persists. I get the Fail to smart search (server error) red error message whenever I search something like "sky"; and the people section remains empty. When I went to administration and clicked on "missing", it seemed to start from the beginning with all 100k+ media remaining.

I have attempted multiple ML settings with the latest one shown below. The RAM usage seems to reflect the CLIP model which leads me to believe it is loaded properly. image image The above load unload behavior repeats indefinitely, with the active app memory use oscillating between approx 800M to 4.6Gb image

It is worth noting there are more queued jobs than total asset image image

Asks: May I ask what might be wrong and what should I do to acquire additional debug data and what might be wrong? For example, is the Cuda Machine Learning Image a bad choice for some reason? the GPU is a old Quadro K620

Fireflaker commented 1 day ago

Nonconclusive solution/workaround:

  1. increase ram allocation to make sure in ML log there is no "perhaps out of ram" error
  2. Use Default ML image instead of CUDA
Fireflaker commented 18 hours ago

image GPU still is not being utilized despite ML running correctly now using the default ML image

Fireflaker commented 44 minutes ago

Cause identified: K620 is too old with compute core 5.0. Requirement is 5.2