Open luxzg opened 1 year ago
Running occ recognize:classify again on the whole server, but that takes hours and hours just to go through the photos that are already indexed and classified.
You should not need to rerun the classify command. Run it once to classify all files, then you can rely on the background jobs to classify changes and new files.
Running occ recognize:classify again on the whole server, but that takes hours and hours just to go through the photos that are already indexed and classified.
You should not need to rerun the classify command. Run it once to classify all files, then you can rely on the background jobs to classify changes and new files.
That depends on circumstances. I have added more than 100.000 files in 2 days, most of them directly by copying from external drives to the server (otherwise I'd be uploading for days).
I still have several drives to merge to Nextcloud, so I'm certaily going to have few more batches with 10.000+ files. Waiting for background job is a poor option. It runs 100 images every 5 minutes, at best.
Likewise, as mentioned before, original "initial scan" would fail every few hours, errors ranged from file locks, to corrupted file errors. Then I'd restart it, it would start at begining, then run few hours again, then fail, basically making me keep restarting with little benefit because it would always start with "Gorilla.jpg"
I have managed to finish the initial batch last night, but it took 10 days to finish with success finally. And background job was setup on first day, but I still had unclassified images in that last scan.
Yes, alternatives would also be not to fail abruptly on file locks, corrupted files, image files with no content (0-byte) and whatever else I've encountered, but I still don't want to wait for days for background jobs to slowly crawl through next terabyte. Btw, I've tried setting job size to 500 photos in background job settings, but it always picks 100. Tried with 250, still only scheduled 100.
So it's either fixing 10 different things in several different places, or adding a --path
option. I'd be ok with --path
as I can (and do) always put new photos in a single new directory, get them sorted, indexed, generate thumbnails, remove dupes, and so on, and all those tasks support "--path" (or are fast tasks that don't matter if they run 2 minutes or 15 minutes). Only recognize:classify
has no such option, is slow to re-run, and is even slower to wait for it's "natural progression" to go through in the background.
If you'd prefer, I can go through my putty logs, some of the errors that crashed classify are probably in there somewhere (though I've also ran it directly on physical server's console later, as I didn't want my SSH to keep running for days, so I don't have all of the crashes logged). If there's other log where those might still show up in NC instance or app logs, I'd be happy to forward them. I don't mind waiting a whole night for re-classification to go through, if I know it won't crash again (and again, and again, for several nights and days)
If you'd prefer, I can go through my putty logs, some of the errors that crashed classify are probably in there somewhere
That would be helpful. Please open new issues (or comment on existing issues) for these. The classify command is supposed to be a stopgap for the automatic classification jobs.
Btw, I've tried setting job size to 500 photos in background job settings, but it always picks 100. Tried with 250, still only scheduled 100.
Can you open an issue for this?
I was going to file for a similar option, but is there a way (when troubleshooting a stuck background job) to force the background process to pick up untagged files after clearing the background jobs? If there isn't, I would like the recrawl option to enable us to ignore already tagged files as well.
@phirestalker is there a way (when troubleshooting a stuck background job) to force the background process to pick up untagged files after clearing the background jobs?
There may be a case to be made that files that have been tagges with recognize's admin-level "Tagged by recognize" tag should not be reprocessed by subsequent re-crawl or classify runs. We did have this at some point... I can't make up my mind right now if this is worth having or not :D Could you open a new issue for this, please? :)
"Tagged by recognize" tag should not be reprocessed by subsequent re-crawl or classify runs.
I agree, this would be very nice. For me there was an error with the background task and it did not restart. Currently I'm trying to run the classify command and it would be nice to know that even if something fails I can just start it again without redoing everything.
I agree that adding some kind of --path
would be nice.
I have finally finshed scanning all my ~100 000 images (it took several days), and am mostly happy with the result. However, for some images it is missing some faces. E.g. there are images with 3 distinct faces, but only two were found on the first run.
It would be nice to try a re-run of the images in question to try to get Recognize to discover the missing faces.
This could probably be done quite easily by just adding the file_id
to oc_recognize_queue_faces
if I understand this correctly?
It probably need to check what kind of tagging/recognition is enabled and add the file to the appropriate queues, but that seem like a doable task.
If so, I can have a look at creating a PR.
Looking at this again, the process might be something like this to make it work for all kind of objects, not just faces:
With this approach this will work both to reclassify already classified files, and you can explicitly add a single path to the queue.
occ recongize:reset-tags
command. https://github.com/nextcloud/recognize/blob/fb268266f9fb932a7f4ed8d3ddb16ca8a1651ffd/lib/Command/ResetTags.php#L40resetClassifications()
https://github.com/nextcloud/recognize/blob/fb268266f9fb932a7f4ed8d3ddb16ca8a1651ffd/lib/Service/TagManager.php#L122fileIds
as an array with only items matching the path (a single file or a subdir) instead of the full list of classified files. https://github.com/nextcloud/recognize/blob/fb268266f9fb932a7f4ed8d3ddb16ca8a1651ffd/lib/Service/TagManager.php#L136getFilesFromPath
similar to getFilesInMount()
that gets the metadata for a list of files https://github.com/nextcloud/recognize/blob/fb268266f9fb932a7f4ed8d3ddb16ca8a1651ffd/lib/Service/StorageService.php#L101It would also be helpful to have this as a ui option. I uploaded my photo library to my raspberry pi, but I think recognize is going thru all my filesystem's photos and not having this will make it take way longer than I want it to.
Either way, thank you for developing and releasing this app! I'm enjoying looking back at the memories for the photos that have been scanned so far!
Describe the feature you'd like to request
Ability to limit
occ recognize:classify
to a single path.Describe the solution you'd like
Something like other apps have :
occ recognize:classify --path="/username/files/MyFolder/"
Describe alternatives you've considered
Running
occ recognize:classify
again on the whole server, but that takes hours and hours just to go through the photos that are already indexed and classified.Alternative could be something like
occ recognize:classify --new
which would skip all the files that have already been classified. (or making this default behavior forocc recognize:classify
and adding something likeocc recognize:classify --force
to force re-scan/re-index/re-classification of whole server)I do realize this app is now on the "limited effort" maintenance, but hopefully someone picks up the task eventually...