stashapp / stash

An organizer for your porn, written in Go. Documentation: https://docs.stashapp.cc
https://stashapp.cc/
GNU Affero General Public License v3.0
8.45k stars 751 forks source link

[RFC] Phash Improvement for Short Durations #3722

Open Flashy78 opened 1 year ago

Flashy78 commented 1 year ago

Problem

The current phash implementation builds a collage image of 25 frames taken at regular intervals, ignoring the first and last 5% of the video. The phash of this image is then calculated and stored as the phash of the video.

This has been a successful algorithm and has allowed matching of similar videos that are not exactly the same.

However, over time it has become apparent that this algorithm is not well suited to shorter duration videos, as the frames tend to be very similar without much change between them, resulting in a collage with lots of repetition. Images with repetition like the sky or the ocean do not hash very well with phash algorithms, and end up with similar hashes. This means many short videos end up with similar hashes, even though they are different.

Recommendation

I've run several tests generating more/less frames per collage, larger/smaller/portrait/landscape frames and collage, as well as switching to using a wavelet hash function. After testing, I recommend the phash implementation is modified to take fewer frames for shorter videos, to ensure less repetition in the collage image. I suggest the following for video length x:

Time Frames
x > 2.5min Current 25
2.5m <= x > 1.5m 16
1.5m <= x > 45s 9
x <= 45s 4

This sort of aims for around 7 seconds between frames, until all shorter videos just get 4 frames. Similar 1 second clips from the same video were able to get > 10 hamming distance hashes.

Result

Shortening the number of frames taken means it's harder to match slightly different videos, but it's probably a worthwhile tradeoff, and will only affect shorter videos. This does not eliminate similar hashes of different videos, but should provide a large reduction in false positives for short videos. Longer and shorter videos with a fixed camera, with a wide shot and only a minimal amount of movement in a section of the frame will still produce repetitive collages that produce similar hashes.

Open Question

Modifying the phash function to implement this is trivial. The larger question is what should be done about all of the existing videos that are 2.5 min or shorter? A migration task could be written, but then nobody's library would match up with StashDB anymore. There are definitely plenty of valid short videos already on StashDB. A new hash type could be created in Stash (shash?), but it would need to be simulated as a phash when interacting with StashDB.

scruffynerf commented 1 year ago

My take on Open Question(s)

A migration task could be written, but then nobody's library would match up with StashDB anymore. There are definitely plenty of valid short videos already on StashDB.

The problem of how to obsolete the older method is interesting: we know the duration of any given hashed video on Stashdb... SO we could flag those as obsolete after date x, and matches would be encouraged against the new phash method for durations under 2.5 minutes. Keeping in mind that vast majority of videos on StashDB are greater than 2.5 minutes, this is not a huge issue. For PMVstash (where sub 2.5 minutes are common), we only have 1700 videos in total, so it's easy migration (someone with a stashid-ed match can rehash, and resubmit the match (based on oshash/duration/title), and repopulate the db... (same for StashDB, but less likely to occur). Unsure about TPDB... but I suspect closer to the StashDB numbers, despite being 10x the size.

if you phash the new way, and get no match, potentially we could have a 'phash check' that used the old way and sees if the phash is listed. This could be a python-ic scraper.

I don't think we need a new hash type... like shash. Namer and other tools that generate phash can be easily upgraded.

Flashy78 commented 1 year ago

Or maybe we just leave existing phashes as they are, and let it organically migrate over time as people use the new method, they can match to existing scenes and add their new hashes.

DogmaDragon commented 1 year ago

My shortest content is PMVs, so I don't really have sub 1 minute content, but I never had problems with current pHash implementation. I think the underlying problem is not so much the duration, but the static content from fans-type sites where they show closeup of the person without any substantial changes.

Flashy78 commented 1 year ago

My shortest content is PMVs, so I don't really have sub 1 minute content, but I never had problems with current pHash implementation. I think the underlying problem is not so much the duration, but the static content from fans-type sites where they show closeup of the person without any substantial changes.

It's really the combination of both. If the video doesn't change much, the collage is very repetitive. Fans-type content has this issue, but it happens in all kinds of videos that have only a small percentage of the screen changing. Fans-type content compounds the issue simply because it is often also short duration.

If every video had lots of movement, there wouldn't be such an issue. If every video was very long, there wouldn't be such an issue (assuming there was movement at some point because it's a long video). Combine the two though and 25 frames causes an issue.

https://stashdb.org/scenes/e91e7f73-afbc-4176-af36-1dee4c5fec42#fingerprints - Many of these collisions are short videos, but not all https://stashdb.org/scenes/4e19dff7-ec90-4a25-818b-f5594accd76d#fingerprints - Lots of these are longer videos

So my proposed change will not eliminate the issue. A blanket change on lowering the default to 16 frames would help differentiate those videos, but it would also make it harder to match minor variations on all videos, and totally break existing phashes.

BonerFide commented 11 months ago

I think the number of invalid hashes are going to rather quickly make the valid ones frustrating if there isn't at least a tweak to the bulk identify feature to be able to ignore short duration hashes, both in terms of source and the target hash. I've run a bulk identify and had dozens of short clips decide they match a couple of much longer clips, which now look very polluted in terms of their matched hashes.

I would argue given the very very different nature of the clips I've seen match (eg matching a vertical clip to a horizontal clip), PHASH is very close to completely useless for small clips as it stands, and may never be fully useful, even with this change. For very short clips (eg < 1min) I imagine most are exact matches and so may do well to be matched only if the files are absolutely identical, or via a manually selected identify at most.

Keeping in mind that vast majority of videos on StashDB are greater than 2.5 minutes, this is not a huge issue.

I think this misunderstands the issue. If people have a large number of small clips, it's a huge issue for them, and their mis-matched PHASHes will start polluting longer clips, over time making smaller clips less and less likely to match.

Personally if it was an option I would just set any matching on PHASH to ignore both target and source for clips < 1min (maybe user specified, but not a bad default) when auto identifying. It may be also an option to configure default bulk identify behavior when the duration is outside a certain % difference or time from the most common PHASH/OSHASH duration.

For now even the 'valid' PHASH's on short durations seem to be not much more than data pollution.

Flashy78 commented 10 months ago

I would argue given the very very different nature of the clips I've seen match (eg matching a vertical clip to a horizontal clip), PHASH is very close to completely useless for small clips as it stands, and may never be fully useful, even with this change. For very short clips (eg < 1min) I imagine most are exact matches and so may do well to be matched only if the files are absolutely identical, or via a manually selected identify at most.

Perceptual hash was developed for images, not videos. The Stash algo takes frames from the video to create an image and that image is hashed. So the duration of the video by itself is not the deciding factor on how different a phash is. It's due to what the resulting image Stash creates. So provided the image of video frames being hashed is not super repetitive, you can get perfectly unique hashes for 1 second videos.

Keeping in mind that vast majority of videos on StashDB are greater than 2.5 minutes, this is not a huge issue.

I think this misunderstands the issue. If people have a large number of small clips, it's a huge issue for them, and their mis-matched PHASHes will start polluting longer clips, over time making smaller clips less and less likely to match.

There is no pollution happening when users run Identify. Identify does not let you submit your fingerprints back to StashDB.

Personally if it was an option I would just set any matching on PHASH to ignore both target and source for clips < 1min (maybe user specified, but not a bad default) when auto identifying. It may be also an option to configure default bulk identify behavior when the duration is outside a certain % difference or time from the most common PHASH/OSHASH duration.

For now even the 'valid' PHASH's on short durations seem to be not much more than data pollution.

That is a bandaid on top of the underlying problem that the algo Stash uses for calculating phash, and hence the reason for my suggested improvement. Those may be valid requests for new features, but they do nothing to try and stop the problem where it starts.

BonerFide commented 9 months ago

So provided the image of video frames being hashed is not super repetitive, you can get perfectly unique hashes for 1 second videos.

The problem is that many of these short videos ARE super repetitive. They're short enough that they're all person in center of frame talking to the camera with a plain white background. The one person can have dozens of these a human would have trouble distinguishing from a full resolution thumbnail much less what PHASH has to work with. And yes, the PHASH probably very much WILL be unique, but the way we're matching PHASHes for obvious reasons isn't on 100% exact match. There's some distance it will allow for it to be different and it will still match which 99.9% of the time is the whole point of PHASH vs just a checksum. Much less, but still very often you will upload an image to something like google to find similar images and end up with Google being completely sure that this totally different image is exactly the same because there is fairly significant collision potential.

Yes, it was developed for images, not videos and that's why with Videos we get MORE not less collisions than with image matches, as just to start with there's some repetition as we're using one PHASH for multiple frames merged. Because longer videos often have multiple framings of the video, they're less likely to be repetitive, but short videos, unless 'quick cuts' are going to be repetitive by their very nature. People aren't repositioning the camera 8 times for a 15 second video.

It's great identify doesn't submit PHASHes. There are enough ones where dozens of humans have decided 'close enough' for completely different scenes.

I know certain things sound like bandaids, but the reality is PHASH is never ever and was never ever designed to give 100% unique PHASHes. It itself is a very very good bandaid, the less data you run through it, and the smaller your database, the better it seems.

nod44 commented 1 month ago

Hey all -- I too have notices that short videos (especially from fan sites) are frequently mis-identified. I suggest the following approach:

Thoughts?