[RFC] Scenes with 100's of bad phashes keep growing faster without stopping

stashapp / stash-box

Stash App's own OpenSource video indexing and Perceptual Hashing MetaData API

MIT License

223 stars 62 forks source link

[RFC] Scenes with 100's of bad phashes keep growing faster without stopping #814

Open YonderboyLupus opened 2 months ago

YonderboyLupus commented 2 months ago

Scope

This scene is an example. There are over 300 fingerprints including over 100 phashes. The durations range from several seconds, to over an hour, so they clearly are mostly just wrong and not legitimate phashes from abridgments, re-edits, etc. Apparently, close-enough-to-match, but wrong, phash are being submitted, and as more accumulate, the probability of such misidentification increases. There were 25 new phashes in 2022, 61 in 2023, 59 in 2024 year-to-date, showing that phash count increases faster without stopping. I found another similar scene while running the Identify task, so there are likely many others.

Long Form

In brief, I'm suggesting: 1) Some exploratory queries are done to determine the scope of the problem: how many scenes have an egregious number of phashes, what are the worst cases, and how fast are they growing? 2) Remediate the worst cases found above by manually removing all or most of the fingerprints. 3) Long term, find a way address these cases before they become too large, maybe through email alerts to an admin with this job. More details on these suggestions are below.

In the Discord discussion of this, the concern I'm raising here become somewhat conflated with the more general problem that many scenes have some bad fingerprints, or what the procedure should be to remove bad fingerprints in these ordinary cases, or how this might be automated. I'm not arguing for any change in the automatic behavior of StashDB or the Stash client, but that these especially egregious cases should be looked into and remediated, even if just manually for now. These cases are very different from the usual case of a scene having some incorrect phases. The usual case hasn't achieved the critical mass that makes it continue to grow forever. Both cause incorrect IDs for users, but the egregious case will cause more and more incorrect IDs for them as time goes on. (Not a duplicate of 304).

My suggestions in further detail: 1) I think at the very least, someone should take a few minutes to estimate how bad the problem is by running a few exploratory queries directly on the db or on a backup. If there only a few cases like this, it may not be a big deal. If there are many or they are growing very fast, it may be a higher priority.

For instance:

select scene_id, count(hash)
from scene_fingerprints
where algorithm = 'PHASH'
group by scene_id
having count(hash) > 50
limit 10;

I put in a limit clause so it doesn't run too long, but if doing this on a backup db you could find all such cases and rank them by hash count. Then individual scenes could be looked at for growth rate.

2) Based on findings from above, make a remediation plan. Choose some threshold number of phashes that is too large, even in the case of many legit re-edited and re-released version of the scene. If there are many bad cases, there may be more of an argument for automation. I would argue it would be better to err on the side deleting all the fingerprints and starting over for these scenes, but if a canonical phash can be determined, just that one could be kept, or all that are relatively close to it in duration.

Before any rows are deleted, they could be backed up somewhere in case there is desire to analyze the source of bad fingerprints later. If some users are especially bad repeat offenders, they might be asked to stop.

3) Long term, find a way address these cases before they become too large, maybe through email alerts to an admin with this job. I know email alerts are still being worked on as a feature (284), but in the shorter term, this could just be a report that is run occasionally and then reviewed.

Examples

Example scene fingerprint list: https://stashdb.org/scenes/24a9fa74-1bde-4406-946f-1524f5784b32#fingerprints

Gykes commented 2 months ago

I agree with pretty much everything you said here. I believe there is a Google Doc out there somewhere with a list of known bad hashes. The issue is just finding people that would want to sift through it and delete bad stuff.

Per your example, if it was me, I would just wipe all the phashes out and then ask someone with the scene to resubmit so there is at least one for reference. It becomes problematic on ones that aren't so obvious and would require looking at each one.

If I'm being honest I don't fully trust phash match for this issue. It only takes 1 person to submit a bad fingerprint for it to cause these types of problems down the road.

BonerFide commented 2 months ago

Some scenes do have a high % of bad PHashes. Unfortunately what you've shown isn't actually a scene with lots of very bad PHashes but an example of how the PHash algorithm is inaccurate for very short duration scenes. Most PHashes in the example are actually fairly close to other PHashes, and so the scene has been 'walked' through a bigger range. For most of the bad examples if you only had the 'real' Phash this scene would STILL be suggested to people with those shorter scenes anyway, and you'd be right back to square one within a week. It is absolutely possible to have a scene with 50 different PHashes that are all 'correct'.

The scene itself also doesn't include a duration, so algorithmically it's impossible to tell which is the canonical duration. This is really only a problem with scenes that were added to stashdb during the initial population (you can see by the fact that the edit history doesn't include an initial create edit). This is why the problem has grown since changes this year made it more obvious in tagger which is the 'correct' scene.

My suggested fix here is to add the real scene duration to that scene. This will stop the other durations from being aggressively suggested in tagger.

I wholesale disagree with your proposed solution as both ineffective and unworkable from an effort point of view, as well as having other negative side effects such as NOT suggesting the correct scene due to things like added intros / and making it much harder to remediate scenes by over-relying on admin intervention which I guarantee will not happen. There's an enormous list of potentially bad PHashes, and nothing has, or in my view, ever will happen with these.

A better suggestion IMO is to

a) Work through all scenes without a duration and add a duration, or just bulk delete them all because they will almost all suffer from this issue. b) Change the PHash algorithm for shorter scenes to capture more variation c) Never suggest scenes of dramatically different duration from the canonical duration via identify, and/or disallow identify to function based on PHash for short scenes d) Add an extra confirmation step in tagger if matching a scene of a very different duration from the canonical duration, or if no canonical duration exists.

Making these 4 changes would both fix any issues caused by the fingerprints you highlighted, making removing them irrelevant permanently without ongoing manual effort required to monitor. It also wouldn't have the negative side effect of preventing people from matching abridged versions.

BonerFide commented 2 months ago

Basically, stopping 'bad' matches is a way way better solution than continually monitoring them and deleting them. People aren't putting them in there on purpose. They're actually being suggested to them as 'good' matches without sufficient warning that they are not good matches. Some of this is a data problem in StashDB (missing canonical duration) and some is a problem with tagger (insufficient warning). The tagger view has significantly improved in the last year, so some of the mis-matches are legacy due to a similar PHash being added prior to these changes. But it still has problems with scenes without a canonical duration and providing sufficient warning should it not exist.

Identify is also just far to generous in what it considers a match, with no configurability. Even if you only had correct fingerprints on scenes, you can have collisions (eg multiple scenes with the same or very similar PHash) So you can't fix identify to be more accurate without the ability for it to be set to be more cautious.

echo6ix commented 2 months ago

I am not claiming to know the optimal approach here, nor am I questioning the validity of the solutions proposed. However, I want to address this, and forgive me if I'm taking the quote a bit of out context and veering into a tangent, I just want to emphasize a point seems to get overlooked often:

People aren't putting them in there on purpose.

Maybe, maybe not, but the ability to vandalize the database through erroneous Tagger submissions is wide open.

StashDB has been fortunate to have mostly good faith actors over the past four years, but that pool of contributors has been relatively small. As it grows, so do the odds of bad faith actors.

The threat of vandalism is real, and I think when proposing solutions the calculous should include an assumption that there is a threat of bad faith actors willing to vandalize the database.

Anecdotally over the years I see bad faith actors on StashDB spring from trivial disputes into buttmad resentment and then varying degrees of vandalism or threats of it. We now have recurring trolls, one that went so far to show intent to dox a contributor. If we already have that level of unhinged, who's to say a disgruntled user won't stoop to polluting the db of fingerprints with erroneous Tagger submissions.

TLDR when implementing something to mitigate bad fingerprint submissions, do include in the decision making process the assumption that bad faith actors do exist.

BonerFide commented 2 months ago

People aren't putting them in there on purpose.

Maybe, maybe not, but the ability to vandalize the database through erroneous Tagger submissions is wide open.

I think the beauty of my suggested solution is that erroneous fingerprints simply have minimal to no impact on anyone. It requires no intervention by admins, it doesn't matter how many people try and vandalize it intentionally or otherwise, they can't cause something like identify to malfunction because you can filter out such entries at the client. Having potentially got the scene in question, are in the best position to determine if a 'maybe' match is a real match.

Meanwhile anything that requires admin intervention requires you to know for certain that the fingerprint is wrong. As I've found scenes in the past where 100% of the few dozen fingerprints are definitely wrong (they were for another scene) this means you can't even get a 'good' idea that the fingerprint is wrong unless you have both the correct and incorrect scene.

This moves the biggest opportunity for the scene to be incorrect to the initial entry.

Something that can be actioned by an admin, creates the potential for more abuse/vandalism, since the will of the crowd can really be overridden by someone suggesting that a bunch of fingerprints are wrong to an admin, while the admin has no capability to check this even if they have the scene.

From a solution point of view it makes no difference if the actions are intentional or accidental. That fingerprints will be mis-matched is an absolute certainty, the solution needs to mitigate that reality rather than attempt to manipulate it through manual action, something that just won't happen.

Basically the stash client makes the assumption the fingerprints are correct far to easily. It doesn't use all the information potentially available to it to decide if this is a pretty certain bet, or something requires human review. That's the weakness in the system even if all the fingerprints are perfect, there will still be collisions.

The motivation to vandalize intentionally goes away when it no longer impacts anyone as well, so those comparatively rare cases also go away.

YonderboyLupus commented 2 months ago

the PHash algorithm is inaccurate for very short duration scenes

This does seem to be the crux of the issue. Of the 148 phashes, 126 are less than 5 minutes in duration. According to the studio link, the real duration is "30 min". In this second example I found, 132 of 162 phashes are less than 5 minutes, and 80 are less than 10 seconds. It seems that phashes of short clips aren't sufficiently different from each other to be used for identification. I had assumed there was a critical number of phashes after which fast-growth would start, but from graphing phash count over time, it looks like getting even just one of the these pathologically over-matchable phashes (which was a close match for the initial, probably-correct 30 minute phashes) is enough to start the fast growth immediately.

I suspect a lot of these short clips are from fansites. Fansites aren't part of StashDB's scope, but people are still running Identify and Tagger on them. In my case, the false matches were on fansite clips of 0:16 and 7:38 durations.

Given this, I'm now more inclined to think stopping bad phash matches from happening in the first place needs to be a part of the solution. Though if that stopped new bad cases from cropping up, I don't see why the existing cases shouldn't be addressed in one final clean-up. Otherwise, they would still cause many false IDs for users. I agree with some of Bonerfide's suggestions towards this goal, but not others. But before that:

Does anyone have thoughts on part 1 of my suggestions (running exploratory queries directly on the db or a backup)? This couldn't negatively affect anything, and then even if the decision was made not to address these cases, it would be made knowing the magnitude of the problem, and roughly how much bad data is being added to the db in a given time period. Apparently there is already a Google Doc tracking these? Are bad stashID/phashes added as they are found, or is it populated by some query? May I see the Doc if it's widely sharable, or if not, could someone summarize how many entries there are and about how often they appear? Also, Bonerfide says this is largely a problem with scenes missing canonical durations. Can this be verified? How many scenes are missing canonical durations? What fraction of duration-less scenes are experiencing fast-growth? The validity and practicality of solutions being discussed are affected by these answers.

c) Never suggest scenes of dramatically different duration from the canonical duration via identify, and/or disallow identify to function based on PHash for short scenes

I think something like this is probably need to fix the problem. Under 5 minutes or so phashes seem to have the problem of incorrectly matching each other too often. This threshold could be determined more precisely, and StashDB could stop accepting phashes shorter than that duration. I imagine almost all studio scenes are longer than this minimum duration, but the few that are shorter would just not be identifiable by phash, only by query, fragment, or MD5/OHASH exact match. If someone links their shorter scene to a StashID in Tagger (whether rightly or wrongly) that phash shouldn't be flagged to be sent to StashDB, maybe just the MD5/OHASHes. There might also be a range of durations, above the minimum, but which are still short enough to cause problems. Maybe there should a stricter requirement on distance between phashes if at least one of the phashes durations is in this range, and/or the two durations are very different from each other. I wonder if FansDB has a strategy for dealing with these shorter clips?

YonderboyLupus commented 2 months ago

I think suggestions "a)" and "d)" are based on incorrect premises, and either would have negative effects or just not solve the problem. The incorrect premises are 1) that duration-less scenes inevitably have this problem, and 2) that Tagger UI improvements would stop the majority of the bad submissions.

Incorrect Premise 1): Duration-less Scenes inevitably experience fast-growth

For most of the bad examples if you only had the 'real' Phash this scene would STILL be suggested to people with those shorter scenes anyway, and you'd be right back to square one within a week.

I looked at 20 of the oldest scenes from this studio which don't have durations and didn't find any others with a large phash count, fast-growth of phashes, or a large variance of fingerprint durations. Their fingerprint history dates back to 2021, so they've been working fine for years. There's no reason the example case wouldn't also stay fixed if the all the fingerprints were removed, since the user who submitted the inital over-matchable phash probably wouldn't do so again. If someone did a more comprehensive query or report showing most duration-less scenes have problems, I'd change my mind on this. Premise 1) is the basis of:

a) Work through all scenes without a duration and add a duration, or just bulk delete them all because they will almost all suffer from this issue.

I'm not against a project to add missing durations, but this would be a lengthy effort of many editors. Why consider that realistic while stating categorically nothing will ever will happen to with the bad phashes? There are likely many more scenes with missing durations than there scenes with many bad phashes. Some scenes may not have an authoritative source for their duration anywhere. Would these scenes just have to be banned from the StashDB forever? The second version of this suggestion, bulk deleting all duration-less scenes, would be a huge waste editors' work to provide their metadata, especially when just deleting their fingerprints would have worked instead. You are (rightly) concerned about preserving matching of abridged scenes through phash matching, but bulk deleting these would make them unavailable for phash matching, and exact hash matching, and matching by query, while my suggestion would only temporarily make a much smaller subset of scenes unavailable for fingerprint matching, until new ones are submitted.

Incorrect Premise 2): Tagger UI improvements would stop the majority of the bad submissions

d) Add an extra confirmation step in tagger if matching a scene of a very different duration from the canonical duration, or if no canonical duration exists.

I'm not against these UI changes per se, and I'd personally find them helpful. I just don't think they'd solve this problem. In example 2, there were 80 cases of people running Tagger on 1-10 second clips, then being shown a scene cover image that is very different from their clip's generated cover, with a scene name that doesn't match their filename, but still clicking "Save". I don't think these peoples' problem was that they just needed to know the duration was off, or that they wouldn't just blindly click through a confirmation screen as well. We don't need to assume malice, but they're likely either just not careful with matching at all, or they're using a plugin like this one to get a "Save All" button, and accepting matches without seeing them first:

Scenes tagger

Maybe the plugin could be rewritten so fingerprints from matches made this way aren't flagged to be sent to the stash-box? If people want to be careless with matches on their own data, that's their prerogative, but they shouldn't then submit matches that have had no human review at all. In any case, I think an automatic block of matches on too short scenes along the lines of "c)", or the stashbox not accepting fingerprints from these matches would be a more effective solution. The check needs happen on the stash-box side, since you can't assume the user isn't running a plugin that bypasses a check client-side.

Other thoughts:

b) Change the PHash algorithm for shorter scenes to capture more variation

Wouldn't this invalidate all the current phashes in the db with shorter durations? That might be acceptable if they are all being thrown out anyway to implement the "disallow identify to function based on PHash for short scenes" part of "c)". I imagine it would also require changing the Stash client to use this new algorithm and then a change to the API version so the stash box knows which version of the phash it is getting from different client versions. I don't know enough to have an opinion on how likely this is.

DogmaDragon commented 2 months ago

I wonder if FansDB has a strategy for dealing with these shorter clips?

We haven't run into any duration related pHash issues, even allowing scenes as short as 1 second (though the hash could be improved for shorter files).

I suspect the bad hashes StashDB is dealing with is mostly the issue of the past before pHash was implemented. Where they bulk imported a bunch of scenes without any hashes. Which is the case for both your examples (you can check that by seeing they have no edit history for the scene being created).

<...> they're using a plugin like this one to get a "Save All" button, and accepting matches without seeing them first:

The plugin by default clears the fingerprints from the queue when using Save All feature, they are not getting submitted back to stash-box instances.

BonerFide commented 1 month ago

I looked at 20 of the oldest scenes from this studio which don't have durations and didn't find any others with a large phash count

I've looked at thousands of mismatched scenes and missing duration is the prime cause (along with, no original fingerprint, so any matches that did exist, required someone to search by performer name / studio, probably picking the wrong scene. Because the real scene for that combination didn't exist, so they picked the one that did. The reverse doesn't work and is an incorrect assumption. Because incorrect original duration causes mismatches doesn't mean it's inevitable. Also there will be a 'soaking' effect from looking at the same studio, eg, if the studio uses a common set, common watermark etc, then by having one very heavily fingerprinted scene from that studio it will decrease the chance others from that studio have it. Basically you are looking for mis-matches, I have found the mis-matches when tagging and determined the cause/s. There's a lot of scenes with no mis-matches because other mis-matches have 'soaked them up'.

I'm not against a project to add missing durations, but this would be a lengthy effort of many editors.

In theory you could just run a query that added the 'most' common duration to be the canonical duration, while it is unlikely this is incorrect, in some rarer cases I have seen it is. It's way way less effort than removing fingerprints that you don't really know are correct or incorrect because you only need one correct file to know it's correct.

Some scenes may not have an authoritative source for their duration anywhere. Would these scenes just have to be banned from the StashDB forever? The second version of this suggestion, bulk deleting all duration-less scenes, would be a huge waste editors' work to provide their metadata, especially when just deleting their fingerprints would have worked instead.

If no one has the file how can you possibly determine you are deleting the right fingerprints instead of just deleting most fingerprints and leaving a bunch that are wrong? As I've discussed, many scenes originally had incorrect fingerprints from their first submission so the originally submitted fingerprint may not be correct. I'm not seriously suggesting deleting the scenes in most cases, but the same applies to fingerprints, these are 'work' for people who have gone through and matched scenes. Deleting them without any knowledge over if they are actually an abridged version, or actually the only one in a sea of incorrect fingerprints that is right, deceases the utility of StashDB as a whole. Yes, you might not want to auto-identify a scene based on a rare fingerprint, even if that one is right, but people are often looking at a file with a random filename they downloaded and using the fact that it has a PHash someone else has to identify it.

Incorrect Premise 2): Tagger UI improvements would stop the majority of the bad submissions

The fact that newer sites like FansDB don't have them, and have been around since some fairly minor tagger improvements kind of proves this point.

then being shown a scene cover image that is very different from their clip's generated cover, with a scene name that doesn't match their filename, but still clicking "Save".

Until relatively recently, there was no visual difference AT ALL displayed in the tagger between a scene where you matched 99/100 fingerprints and 1/100 fingerprints.

It's quite possible the image was very similar (maybe they are the same scene just a clip from it), also that's a problem with the fact that we use covers, and that local users may not have generated a full preview / sprites. One of the things I would like is sprites or at least one image ~15% into the scene to be used in the tagger from BOTH ends. At the moment the one thing we know at stashDB with almost certainty is that the cover will be different in some capacity, so people are being trained to ignore differences in the image in favor of same performer/same logo (often not there in a cover)/same set. The same way the duplicate finder compares scenes, not using the cover. This is a bigger problem in say PMV Stash where the cover doesn't even have to have the performer/s on it.

The 'right' scene also quite often has a very different cover from the frame of video.

Filename is a good 'does this make sense' but filenames are still sometimes random nonsense.

People need big contrasty warnings with minimal false negatives or positives, not what they previously got which were all ticks on something that was obviously wrong, or even now a fairly subtle indication which is probably 1/4 of the time giving a false negative and training people to ignore it.

Wouldn't this invalidate all the current phashes in the db with shorter durations?

Yes, you would have to make it a new type like 'PHash2' if you wanted to keep them as separate indicators. It's mainly a problem for talking head type scenes, eg a performer giving an interview against the same background or worse, an all white background. The PHash algorithm sees that as a dark spot in a light background and there's minimal variation due to the fact we're compressing the detail in each frame by using more frames.

The biggest reason to handle this on the client is that any such rules handled on the client work regardless of what people do maliciously, they don't need admin action so they won't 'return' to being a problem after some big clean up.

Anything you can script server side, the client can decide, and you can leave it up to the end user how 'sure' they want to be for a match. Some people may want to configure identify to work only when it gets no negative indicators, others may be ok with it being right 99% of the time, others 90% of the time if it saves them manual effort. It's also impervious to people being malicious, if you're making it obvious that these things are different, then it's on the person doing the matching that they accepted the incorrect scene.

BonerFide commented 1 month ago

We haven't run into any duration related pHash issues, even allowing scenes as short as 1 second (though the hash could be improved for shorter files).

I suspect the bad hashes StashDB is dealing with is mostly the issue of the past before pHash was implemented. Where they bulk imported a bunch of scenes without any hashes. Which is the case for both your examples (you can check that by seeing they have no edit history for the scene being created).

Yeah I think that is a number of reasons.

Places like FansDB have always required hashes (there's still the occasional collision on PMVStash given the nature of the work). Also lots of FansDB has very few matches at all, it's still quite small/young in comparison to StashDB users. And it's had the better tagger interface in use at the client for most of its existence (I think this is a big one).
I do actually get the reverse, say, when I run tagger against a bunch of scenes on Fansdb it does start suggesting Fansdb scenes to my StashDB type scenes mixed into the list. They are very obviously wrong and thus easy to identify/ignore when they happen. In the case where we know the original is 2 minutes and we have a 20 minute scene then perhaps that case filtering out on the client permanently would be good. As you can make a scene shorter, but it's unlikely to be made exponentially longer.
With fewer fingerprints and re-encodes OSHash / MD5 makes more difference. Lots and lots of the studio scenes are re-encoded, thrown on tube sites, cut down etc. While it seems largely people get their FansDB clips mostly unadulterated bar a few resolutions and a few rarer exceptions which FansDB at least in theory is supposed to handle as different studios for each legitimate distribution.
Minor, but people probably expect the scene not to exist, while I've frequently seen people match in stashDB based on 'same performer' + 'same studio' text search. They assume that the one result returned will be correct because they're assuming that the scene exists. While the issue is that the performer with that studio has multiple scenes, and so the fingerprints end up consolidated to the one scene, in many cases, in an unfavorable ratio to the 'real' fingerprints that may get added later.

Sounds like if Save All doesn't send fingerprints, this is purely about making sure tagger makes it really obvious when fingerprints have anything known suspicious about them, and being able to realistically grade how suspicious based on as many factors as are available.