nextcloud / recognize

πŸ‘ πŸ‘‚ Smart media tagging for Nextcloud: recognizes faces, objects, landscapes, music genres
https://apps.nextcloud.com/apps/recognize
GNU Affero General Public License v3.0
558 stars 46 forks source link

Face recognition blocked for 1 month: Large batch size causes job to be stopped during preview generation #967

Closed EVOTk closed 11 months ago

EVOTk commented 1 year ago

Which version of recognize are you using?

4.3.2

Enabled Modes

Face recognition

TensorFlow mode

WASM mode

Downstream App

Memories App

Which Nextcloud version do you have installed?

27.0.2 27.1.0

Which Operating system do you have installed?

Debian 12

Which database are you running Nextcloud on?

MariaDB 10.11.4 10.11.5

Which Docker container are you using to run Nextcloud? (if applicable)

Linuxserver 27.0.2 27.1.0 PHP 8.2.10

How much RAM does your server have?

32 GB

What processor Architecture does your CPU have?

x86_64

Describe the Bug

For the past 1 month, I've had no new face detection on the photos I've added.

the process appears to be "blocked"

image "Last classification: 06/08/2023"

occ recognize:recrawl or occ recognize:classify not resolve issue

I've also tried "clear the classifier queues and clear all background jobs" and relaunching a complete classification, but it always comes back to the same thing.

I didn't run occ recognize:reset-face-clusters and occ recognize:reset-faces because I don't want to lose the current classification.

Thx for your help

Expected Behavior

Face detection on new photos

To Reproduce

  1. Upload new photo on Nextcloud
  2. Faces are never detected

Debug log

No response

EVOTk commented 1 year ago

image

I've added a few photos. But the process remains at the same stage, even though the jobs in the background are running.

marcelklehr commented 1 year ago

Do you have more than 120 new faces? You can check your database with the following query SELECT cluster_id, COUNT(id) from oc_recognize_face_detections GROUP_BY cluster_id;

EVOTk commented 1 year ago

Thank you for your reply. I use adminer to manage my database.

The command returns an error, should I execute it differently?

image

Reading other issues, I have seen other commands that can help diagnosis, but I do not know what they correspond to:

image

image

image

Thx

EVOTk commented 1 year ago

Hello, I'm add ~100 photos. Same issue.

image


Values for of commands have not changed

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id IS NULL

COUNT(*)
4122

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id = -1

COUNT(*)
6196

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id > 0

COUNT(*)
19202
craiq commented 1 year ago

Hello I've the same problem with different architecture Ubuntu lxc container on proxmox

ok 200 OK 27.0.2.1 yes yes \OC\Memcache\APCu \OC\Memcache\Redis yes \OC\Memcache\Redis no 3634111709184 0.36474609375 0.4296875 0.42724609375 6291456 3680256 6291456 6027264 57 0 6 2356869 23 1 8 14 21 3 0 18 0 0 17 0 0 1 7 2 2 2 7 nginx/1.25.2 8.2.10 1073741824 3600 10737418240 1 136150120 132285336 0 0 67108864 22731392 44377472 92898 2470 4724 130987 79025987 1693939639 0 0 0 0 2519 0 0 99.996812542553 1 1 5 5 6 268435440 267399360 4099 0 12801907 7802 11663 1009 0 1693939639 478816 mmap 1 33554312 33006456 Core date libxml openssl pcre zlib filter hash json random Reflection SPL session standard sodium cgi-fcgi mysqlnd PDO xml apcu bcmath bz2 calendar ctype curl dom mbstring FFI fileinfo ftp gd gettext gmp iconv igbinary imagick intl exif mysqli pdo_mysql Phar posix readline redis shmop SimpleXML sockets sysvmsg sysvsem sysvshm tokenizer xmlreader xmlwriter xsl zip Zend OPcache mysql 10.6.14 1499897856 2 2 4
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<ocs>
<meta>
<status>ok</status>
<statuscode>200</statuscode>
<message>OK</message>
...
</meta>
<data>
<nextcloud>
<system>
<version>27.0.2.1</version>
<theme/>
<enable_avatars>yes</enable_avatars>
<enable_previews>yes</enable_previews>
<memcache.local>\OC\Memcache\APCu</memcache.local>
<memcache.distributed>\OC\Memcache\Redis</memcache.distributed>
<filelocking.enabled>yes</filelocking.enabled>
<memcache.locking>\OC\Memcache\Redis</memcache.locking>
<debug>no</debug>
<freespace>3634111709184</freespace>
<cpuload>
<element>0.36474609375</element>
<element>0.4296875</element>
<element>0.42724609375</element>
...
</cpuload>
<mem_total>6291456</mem_total>
<mem_free>3680256</mem_free>
<swap_total>6291456</swap_total>
<swap_free>6027264</swap_free>
<apps>
<num_installed>57</num_installed>
<num_updates_available>0</num_updates_available>
<app_updates/>
...
</apps>
...
</system>
<storage>
<num_users>6</num_users>
<num_files>2356869</num_files>
<num_storages>23</num_storages>
<num_storages_local>1</num_storages_local>
<num_storages_home>8</num_storages_home>
<num_storages_other>14</num_storages_other>
...
</storage>
<shares>
<num_shares>21</num_shares>
<num_shares_user>3</num_shares_user>
<num_shares_groups>0</num_shares_groups>
<num_shares_link>18</num_shares_link>
<num_shares_mail>0</num_shares_mail>
<num_shares_room>0</num_shares_room>
<num_shares_link_no_password>17</num_shares_link_no_password>
<num_fed_shares_sent>0</num_fed_shares_sent>
<num_fed_shares_received>0</num_fed_shares_received>
<permissions_0_15>1</permissions_0_15>
<permissions_3_17>7</permissions_3_17>
<permissions_3_19>2</permissions_3_19>
<permissions_3_21>2</permissions_3_21>
<permissions_0_31>2</permissions_0_31>
<permissions_3_31>7</permissions_3_31>
...
</shares>
...
</nextcloud>
<server>
<webserver>nginx/1.25.2</webserver>
<php>
<version>8.2.10</version>
<memory_limit>1073741824</memory_limit>
<max_execution_time>3600</max_execution_time>
<upload_max_filesize>10737418240</upload_max_filesize>
<opcache>
<opcache_enabled>1</opcache_enabled>
<cache_full/>
<restart_pending/>
<restart_in_progress/>
<memory_usage>
<used_memory>136150120</used_memory>
<free_memory>132285336</free_memory>
<wasted_memory>0</wasted_memory>
<current_wasted_percentage>0</current_wasted_percentage>
...
</memory_usage>
<interned_strings_usage>
<buffer_size>67108864</buffer_size>
<used_memory>22731392</used_memory>
<free_memory>44377472</free_memory>
<number_of_strings>92898</number_of_strings>
...
</interned_strings_usage>
<opcache_statistics>
<num_cached_scripts>2470</num_cached_scripts>
<num_cached_keys>4724</num_cached_keys>
<max_cached_keys>130987</max_cached_keys>
<hits>79025987</hits>
<start_time>1693939639</start_time>
<last_restart_time>0</last_restart_time>
<oom_restarts>0</oom_restarts>
<hash_restarts>0</hash_restarts>
<manual_restarts>0</manual_restarts>
<misses>2519</misses>
<blacklist_misses>0</blacklist_misses>
<blacklist_miss_ratio>0</blacklist_miss_ratio>
<opcache_hit_rate>99.996812542553</opcache_hit_rate>
...
</opcache_statistics>
<jit>
<enabled>1</enabled>
<on>1</on>
<kind>5</kind>
<opt_level>5</opt_level>
<opt_flags>6</opt_flags>
<buffer_size>268435440</buffer_size>
<buffer_free>267399360</buffer_free>
...
</jit>
...
</opcache>
<apcu>
<cache>
<num_slots>4099</num_slots>
<ttl>0</ttl>
<num_hits>12801907</num_hits>
<num_misses>7802</num_misses>
<num_inserts>11663</num_inserts>
<num_entries>1009</num_entries>
<expunges>0</expunges>
<start_time>1693939639</start_time>
<mem_size>478816</mem_size>
<memory_type>mmap</memory_type>
...
</cache>
<sma>
<num_seg>1</num_seg>
<seg_size>33554312</seg_size>
<avail_mem>33006456</avail_mem>
...
</sma>
...
</apcu>
<extensions>
<element>Core</element>
<element>date</element>
<element>libxml</element>
<element>openssl</element>
<element>pcre</element>
<element>zlib</element>
<element>filter</element>
<element>hash</element>
<element>json</element>
<element>random</element>
<element>Reflection</element>
<element>SPL</element>
<element>session</element>
<element>standard</element>
<element>sodium</element>
<element>cgi-fcgi</element>
<element>mysqlnd</element>
<element>PDO</element>
<element>xml</element>
<element>apcu</element>
<element>bcmath</element>
<element>bz2</element>
<element>calendar</element>
<element>ctype</element>
<element>curl</element>
<element>dom</element>
<element>mbstring</element>
<element>FFI</element>
<element>fileinfo</element>
<element>ftp</element>
<element>gd</element>
<element>gettext</element>
<element>gmp</element>
<element>iconv</element>
<element>igbinary</element>
<element>imagick</element>
<element>intl</element>
<element>exif</element>
<element>mysqli</element>
<element>pdo_mysql</element>
<element>Phar</element>
<element>posix</element>
<element>readline</element>
<element>redis</element>
<element>shmop</element>
<element>SimpleXML</element>
<element>sockets</element>
<element>sysvmsg</element>
<element>sysvsem</element>
<element>sysvshm</element>
<element>tokenizer</element>
<element>xmlreader</element>
<element>xmlwriter</element>
<element>xsl</element>
<element>zip</element>
<element>Zend OPcache</element>
...
</extensions>
...
</php>
<database>
<type>mysql</type>
<version>10.6.14</version>
<size>1499897856</size>
...
</database>
...
</server>
<activeUsers>
<last5minutes>2</last5minutes>
<last1hour>2</last1hour>
<last24hours>4</last24hours>
...
</activeUsers>
...
</data>
...
</ocs>

I did the complete face reset and there be sure that new faces are recognized

Screenshot_20230907_144703_Chrome

The newly detected faces are not shown in the memories app only in Fotos.

illnesse commented 1 year ago

And here i thought, hey maybe i should install recognize again? Let's check github issues if this app is usable now!

Nope, same problems like a year ago

EVOTk commented 1 year ago

And here i thought, hey maybe i should install recognize again? Let's check github issues if this app is usable now!

Nope, same problems like a year ago

Could you please delete your comment? I don't see the need to post it here, if you also have a problem, I think it would be better to detail it, either in this issue if it's similar, or in a new ticket.

illnesse commented 1 year ago

Could you please delete your comment? I don't see the need to post it here, if you also have a problem, I think it would be better to detail it, either in this issue if it's similar, or in a new ticket.

Naw thanks, i opened like 20 issues here already

EVOTk commented 1 year ago

image

SQL command values are always identical to the last time. https://github.com/nextcloud/recognize/issues/967#issuecomment-1707022983

marcelklehr commented 1 year ago

So, the issue is that face recognition is stuck. Can you check if there are jobs in the oc_jobs table for recognize and maybe post the list? (something like SELECT * from oc_jobs where app = 'recognize';)

EVOTk commented 1 year ago

So, the issue is that face recognition is stuck. Can you check if there are jobs in the oc_jobs table for recognize and maybe post the list? (something like SELECT * from oc_jobs where app = 'recognize';)

Thx for your reply

image image image

marcelklehr commented 1 year ago

Ah, sorry about the wrong query. It looks like the job is intact and should be running, last_run indicates 'Sun Sep 17 04:58:21 2023 UTC'. Can you find any errors in the nextcloud log related to recognize (try something like cat data/nextcloud.log | grep recognize)?

EVOTk commented 1 year ago

Here are the attached logs, thank you

log_recognize.log


I see errors related to "Welcome App" . An update of Welcome is available since 1h ( 1.0.10 ) , I have just made the update. https://apps.nextcloud.com/apps/welcome https://github.com/nextcloud/welcome/blob/main/CHANGELOG.md image Are you thinking of a conflict with Welcome App?

marcelklehr commented 1 year ago

Are you thinking of a conflict with Welcome App?

Nah, I think that's unrelated. It's also just warnings. We can ignore those.

I think the reason for this is that we limit the run time of the job to 5 minutes, but by the time the 5minutes are over it still hasn't finished preparing the previews for classification. :/

EVOTk commented 1 year ago

okay, do you think I should decrease the value of "The number of files to process per job run"? I'm currently at 50, because I'm in WASM mode.

marcelklehr commented 1 year ago

If anything, try reducing it, but I will publish a new release soon with a proper fix that will hopefully solve this for you

EVOTk commented 1 year ago

okay !, I've just reduced it to "5" to see if there's any change there.

Thanks, I'll be waiting with impatience for this new version :)

marcelklehr commented 1 year ago

Do let me know if the reduced value changes things, to confirm that I've indentified the bug :pray:

EVOTk commented 1 year ago

Yes yes, of course, I'll come back here to tell you if it works or not.

craig0990 commented 1 year ago

Chiming in to say that I've had the same issue, and this seems to have fixed it πŸŽ‰ (I think?)

Lowered (roughly halving the defaults) the number of files to process and it appears to be working well over the past 24h with "Schedule jobs" now having values some of the time and "Last background job execution" showing values for some πŸ™ƒ

The oc_jobs table had several entries, but the last_run was 0 for all but the top one, and last_checked for the same were "old" timestamps (was just poking around, didn't keep screenshots/logs, might be mis-remembering the exact column).

Whereas now select * from oc_jobs where class like '%Recog%'; seems to show turnover with the jobs clearing themselves and setting up new ones?

I ended up with a large backlog (don't have original numbers, but "Object recognition" has 22,740 queued files right now) because I disabled the app for a few months while I worked out better hardware.

Cheers πŸ™πŸ»

(Edit: Had to drop object/landmark to ~25 since that seems to re-occur once the face recognition queue clears; and this is in native Tensorflow mode, not WASM mode like the parent)

marcelklehr commented 1 year ago

Great! Thanks for letting me know. I'll soon release v5 of recognize which will fix this permanently, so you don't have to reduce the batch size anymore

EVOTk commented 1 year ago

image

Reducing the number of "files to process per job run" solved my problem. New faces are identified! Thanks @marcelklehr

EVOTk commented 1 year ago

Hello @marcelklehr

After lowering the value, and doing a full indexation. New faces have appeared.

however, it's blocked again :( I lowered the value again, but it doesn't change anything.

image

marcelklehr commented 1 year ago

Mmmh. Did it resolve in the meantime? Sometimes a 12h wait fixes things, if not, can you share the jobs table again?

EVOTk commented 1 year ago

Hello

Unfortunately, it's still blocked.

image

image

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id IS NULL

COUNT(*)
4134

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id = -1

COUNT(*)
6020

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id > 0

COUNT(*)
19440
marcelklehr commented 1 year ago

Thank you for providing the information! At least it seems like the job still exists and runs every now and then for ~23s. Can you check the nextcloud log for errors?

EVOTk commented 1 year ago

I'm edit sorry @marcelklehr , wrong command

see attached file for recognize log recognize.txt

marcelklehr commented 1 year ago

Mmh, seems like it's still hanging because of the preview generation taking too long :/ Can you try reducing the batch size to 2 (I know this is suboptimal :D )

EVOTk commented 1 year ago

Thx, i'm try

EVOTk commented 1 year ago

Mmh, seems like it's still hanging because of the preview generation taking too long :/ Can you try reducing the batch size to 2 (I know this is suboptimal :D )

image

That doesn't seem to have solved the problem :(

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id IS NULL

COUNT(*)
4134

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id = -1

COUNT(*)
6020

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id > 0

COUNT(*)
19440
marcelklehr commented 1 year ago

Can you try the latest release (v5). It should hopefully finally fix this :)

ahurlbatt commented 1 year ago

My setup has the same symptom of not clearing queues for classifiers, even with v5.0.1 installed. Turning on debug logging shows each of the background jobs running, but then being "stalled", with a log entry like this (one for each background job):

{
    "reqId": "Cm4um9yQKwppeCKr7gFX",
    "level": 0,
    "time": "2023-10-11T19:15:41+00:00",
    "remoteAddr": "",
    "user": "--",
    "app": "recognize",
    "method": "",
    "url": "--",
    "message": "Stalling job OCA\\Recognize\\BackgroundJobs\\ClassifyFacesJob with argument array (\n  'storageId' => 1,\n  'rootId' => 1,\n) because other classifiers are already reserved",
    "userAgent": "--",
    "version": "27.1.1.0",
    "data": {
        "app": "recognize"
    }
}
EVOTk commented 1 year ago

Can you try the latest release (v5). It should hopefully finally fix this :)

Thanks, I've updated and restarted a full indexing. It will take some time, I'll come back to update then!

ahurlbatt commented 1 year ago

My setup has ... each of the background jobs running, but then being "stalled" ...

After seeing where this log statement comes from in the code, I turned on "unlimited concurrency" and now it's chugging away through the queue! It also isn't choking on the jobs, which is the image that "unlimited concurrency" gives me.

aaronm120 commented 1 year ago

Just wanted to check if there were any other updates on the fix. Still encountering the issue on 5.0.1

image

marcelklehr commented 1 year ago

@aaronm120 Can you send your nextcloug log ideally with loglevel set to debug?

aaronm120 commented 1 year ago

@marcelklehr

Not sure what that "stalling" is, looks like the same issue that others had above

image

marcelklehr commented 1 year ago

@aaronm120 Aah, I think I know what's going on... :see_no_evil:

marcelklehr commented 1 year ago

@aaronm120 Can you try updating to see if that fixes it?

aaronm120 commented 1 year ago

@marcelklehr Looks like it is now resolved with 5.0.3. Thanks for the fix

marcelklehr commented 1 year ago

Closing for now. Please shout here if this is still happening for you :v:

EVOTk commented 1 year ago

For my side, I've restarted indexing, but the system is still blocked.

marcelklehr commented 1 year ago

@EVOTk Which version are you on now?

EVOTk commented 1 year ago

recognize.txt Hello, NC 27.1.2 Recognize 5.0.3

I cleaned up the jobs several days ago, and restarted a complete indexing ( over 40k files ). It takes a long time. Now it seems to be blocked again, unfortunately.

image

-- SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id IS NULL

COUNT(*)
4134

-- SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id = -1

COUNT(*)
6009

-- SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id > 0

COUNT(*)
19389
marcelklehr commented 1 year ago

Can you up the number of files to process per run?

EVOTk commented 1 year ago

Would you like set to 50 as originally proposed?

marcelklehr commented 1 year ago

We can do 500 as recommended in the settings, now

EVOTk commented 1 year ago

Okay, i'm use WASM mode

marcelklehr commented 1 year ago

Ah, right. then 50

EVOTk commented 1 year ago

image

I'm still stuck :( What can I do to help me? Thx