[Feature] Join "Scan" and "Identify" tasks, preferably schedulable.

SlorkMan commented 2 years ago

Is your feature request related to a problem? Please describe.

It would help (in my usercase) if the "Scan" and "Identify" tasks could be joined to one "set it and forget it" Button (after setting my preferred defaults OfCourse.

I use RSS feeds / Powershell automation to add to my Stash. It would be very handy if I could trigger or schedule a Stash "Scan" and "Identify" job.

Describe the solution you'd like A way to trigger / schedule a "Scan" and "Identify" task.

SmallCoccinelle commented 2 years ago

https://github.com/stashapp/stash/discussions/1965 Discussion here with one particular proposal for this.

Hasn't really been worked on in earnest, but we are slowly getting the scaffold in place to do something that would support a request such as this. The Discussion also links to other relevant issues.

bnkai commented 2 years ago

For scripting purposes isnt the graphql api enough ? https://github.com/stashapp/stash/wiki/API has an example how to trigger the scan using curl. Identify is similar, you would just need to use the metadataIdentify mutation query instead. The whole api documentation is avaliable btw in the http://localhost:9999/playground address btw (assuming localhost:9999 is where stash is running)

kermieisinthehouse commented 2 years ago

Maybe a checkbox for "identify new scenes?" Would we show a new modal for it or just use default settings?

SlorkMan commented 2 years ago

Maybe a checkbox for "identify new scenes?" Would we show a new modal for it or just use default settings?

That would be a huge help, I now have to make sort of a queue of scan-autotag-indentify, selecting the same folder three times..

Bazzu85 commented 2 years ago

Hi any news about this? Can be a huge help to scan a file and identify in stashdb immediately.. For long initial scans can be a great feature

DogmaDragon commented 1 year ago

Already possible with GraphQL API and using cron/task scheduler for scheduling. Maybe better documentation on how to take advantage of it would be useful.

Baking it into the UI would amplify the issues with Identify task. Automation is useful, but solely relying on it leads to inaccurate results. I see it as more of an advanced feature.

KyleSanderson commented 1 year ago

Already possible with GraphQL API and using cron/task scheduler for scheduling. Maybe better documentation on how to take advantage of it would be useful.

Maybe, but that's still not native to the application it sounds like.

Baking it into the UI would amplify the issues with Identify task.

What? how is that relevant to the issue. It's a bit crazy that you need to "scan" "import" and "tag" as separate tasks to get anything usable.

Automation is useful, but solely relying on it leads to inaccurate results. I see it as more of an advanced feature.

Certainly not advanced at all, and is how every other library application has worked since the 90s... 😸. If anything you're suggesting bucking the trend by requiring manual intervention on every single usage of the application to review new media. That's completely wild.

A ton of effort has clearly gone into this application - inotify is typically how this is done to be successful. It looks like there's a library here that would be helpful: https://github.com/fsnotify/fsnotify

SmallCoccinelle commented 1 year ago

A ton of effort has clearly gone into this application - inotify is typically how this is done to be successful. It looks like there's a library here that would be helpful: https://github.com/fsnotify/fsnotify

I looked into this (in 2021). File system notification is full of traps. First, operating systems have vastly different capabilities on what kind of notify-support they provide. Second, you are often limited in the amount of things you can watch for changes. Stash users regularly store way more data than what notifiers were initially designed to support. It's fine for watching 100 to 1000 files/directories. But once you get past that, the OS will just throw errors your way.

This leads to the idea that if you want to support this, you want a different design of stash. You want a rather small area of "intake" which is watched and then some kind of archival process moves files from there into an area which isn't automatically watched. You won't run into the limitations of notify-systems this way, but it also requires some major surgery in the way stash works.

DingDongSoLong4 commented 1 year ago

Automated filesystem watching (inotify) is a separate issue: #191. As @SmallCoccinelle mentions, there are a few issues with implementing this.

The reason why we consider Identify to be "advanced" is because it tends to produce inaccurate results, and its effects are irreversible. StashDB is community-sourced, and there are many scenes with incorrect hashes. And a stash-box is probably going to be the only useful scraper source when scanning, because the majority of other scrapers need a URL which won't be present after a scan.

Many users (me included) prefer using the Tagger interface over a fully automated solution (ie Identify). You're still using an automated scraper to get the metadata, but it is much easier to filter out the incorrect matches by just not clicking Save.

All that being said, I do think there is value in being able to automatically scrape content as it is scanned, similar to how you can do a generate as part of a scan. It should absolutely not be the default, and should not be a prominent feature, but I think it would be useful for those who simply want everything to be automatic and not have to tag anything manually.

KyleSanderson commented 1 year ago

A ton of effort has clearly gone into this application - inotify is typically how this is done to be successful. It looks like there's a library here that would be helpful: https://github.com/fsnotify/fsnotify

I looked into this (in 2021). File system notification is full of traps. First, operating systems have vastly different capabilities on what kind of notify-support they provide. Second, you are often limited in the amount of things you can watch for changes. Stash users regularly store way more data than what notifiers were initially designed to support. It's fine for watching 100 to 1000 files/directories. But once you get past that, the OS will just throw errors your way.

So, these need to be considered hints. You're simply watching directories for close events, then looking for an unknown entry and scan it (or a new watch + full scan for a dir). If your db is fast enough, you can kick them into a back queue and bulk stat them on a schedule looking to see if the file has been damaged. You can do this live but chances are you're not going to hit a cool CPU moment.

If you're watching files, indeed you're going to fail. There's also a sysctl that can be adjusted to crank the watches into the hundreds of thousands should that be desirable.

This leads to the idea that if you want to support this, you want a different design of stash. You want a rather small area of "intake" which is watched and then some kind of archival process moves files from there into an area which isn't automatically watched. You won't run into the limitations of notify-systems this way, but it also requires some major surgery in the way stash works.

Not really. It sounds like the entirety of pain here is from the fact the classification engine is not correct, and even outright wrong at times. These all sound like scheduleable features that haven't landed or been prioritized. If there's no future for the feature in the product that's another discussion. Not adding features because of other product bugs is devastating to morale and limits your creative freedom going forward. Imagine how much better the product will be when the bugs are fixed and you don't need to click 12 buttons then wait 2 hours for the scan to finish.

Many users (me included) prefer using the Tagger interface over a fully automated solution (ie Identify). You're still using an automated scraper to get the metadata, but it is much easier to filter out the incorrect matches by just not clicking Save.

Remember (for I suppose this reason today) you have all these different stages on the path to add a single file to the library. If even the scan was automated, the time to classification and review would lessen by your entire library size.

I don't want to necessarily go back to Nullsoft, but Winamp had watch folders 20 years ago for this very purpose.

DingDongSoLong4 commented 1 year ago

Stash is not a self-contained, fully-automated system - it is meant to be an extensible and flexible way to add metadata to your porn in the way you want to. You can add your own scrapers to automate adding metadata, use community-sourced data from a stash-box, or do everything manually yourself.

So there is no "classification engine" - external scrapers or stash-boxes can "classify" content, but manual "classifying" is still very much part of Stash. Stash-boxes are generally crowd-sourced, and thus incorrect matches are common. That is just how it is. External scrapers also vary in quality, and the sites that they scrape data off of vary wildly in what data they provide. Many straight sites, for example, don't show male performers at all.

For many users, their Stash library is quite personalised - much more so than a music library or movie library. You tag your content in the way that you want to browse/watch it. Two different people, but with identical media (videos/images), can have their Stash metadata be completely different.

I don't think scheduling a scan is functionality that desperately needs to be in Stash itself, at least for the moment - the graphql API lets you launch a scan/generate/identify programmatically, so you can use cron or similar to accomplish that externally. The API also lets you run tasks after copying files, or whenever else you want really. I am aware that other organizers have scheduleable tasks, and I agree that having such a feature would be useful, it's just that it's not a priority at the moment.

Regarding filesystem watching, I haven't looked into it myself, but I'm sure that there are ways to have it work without major design changes. However, Stash supports images in addition to videos, and that presents a problem with directories. If you're using folder-based galleries (i.e. every gallery is in its own folder), then you can easily end up with thousands of directories which then cannot be watched reliably.

I don't think having to change a sysctl is a good solution, so to get around the folder count issue, we need to use some sort of intake system (a "watch folder") like what @SmallCoccinelle mentions. As it is right now, Stash does not rename or move any of your content on its own, and this is what would need to change for an intake system to work reliably with large volumes of content. This also isn't necessarily a bug - it's useful to know that Stash will never move or delete your content unless you explicitly tell it to.

For now, you can very easily write a plugin or external script to accomplish all of this - Python of course has many libraries for this kind of thing. You can easily watch a folder, and when a new file appears, optionally move it somewhere else, and then trigger stash to run a scan on it.

nod44 commented 5 months ago

I think a lot of people shared great perspectives in this thread, although the points of view in this thread might be slanted towards the power users persona than the mass market.

Stash would be more popular if the application worked more like the ARR apps in supporting scheduling and automatic identification of files that users might be downloading en-mass with tools like prowlarr. It seems like there is already an API for this feature -- so implementation of a cron or scheduling functionality is likely a fairly low lift (need a scheduling engine / periodic check).

This leads to the real issue -- capacity... Should we add this to the roadmap? https://github.com/orgs/stashapp/projects/5

stashapp / stash

[Feature] Join "Scan" and "Identify" tasks, preferably schedulable. #2084