stashapp / metadata-api-discuss

This repo is the laziest possible way we can have threaded conversations about metadata collection and curation for StashApp
MIT License
6 stars 1 forks source link

Scene hashing and identification #6

Open WithoutPants opened 4 years ago

WithoutPants commented 4 years ago

There has been a bit of discussion on Discord about scene hashing, and I'd like to get my head around how people would expect it to work in the central db.

Stash currently uses MD5 hashing (via crypto/md5). It hashes the entire file.

I've little to no experience with file hashing, so I'm not sure how the various alternatives compare in terms of collisions and performance. There is also the topic of perceptual hashing which may possibly need its own issue to tease out.

Opensubtitles have their own hashing algorithm that might be worth investigating.

The initial model I was targeting was for a scene to have a number of MD5 hashes associated with them in order to be able to identify a scene by its hash. It sounded like there may have been talk of handling hashes of different types?

Anyway, it'd be nice to get some decisions on this for an initial prototype.

Leopere commented 4 years ago

I would vote for Open Subtitles it apparently hashes the first 64k, last 64k, and filesize to identify files. This could be problematic with releases with identical intro's and outro's leaving us with effectively the filesize being the unique data but this might be better than just an MD5 or preferably at least a SHA.

Leopere commented 4 years ago

As for the other discussion regarding similar scene data, we were discussing various perceptual hashing options. Options to be used for comparing data across transcodes and potentially even identifying content in compilations, which would be incredible for some people seeking to identify performers in compilations with no cited sources.

However, perceptual hashing is still a bit of a new research field, so there are limited options. phash was one that was mentioned and TMK by facebook for identifying offensive or illegal content for their ThreatDB. It seems that TMK might be the ideal option. However, it doesn't seem like it would work for lookups so much as a confirmation option. I am not an expert in perceptual hashing so I don't know if the data from the p-hash encompasses the whole video in an efficient way or not or if theres a good lookup methodology for scenes that is open source.

WithoutPants commented 4 years ago

I have another proposal:

The issue with hashing the entire file per stash's current implementation, and with OpenSubtitle's approach, is that even changing the metadata of the file is enough to change the resulting hash of both algorithms.

I propose that we use ffmpeg's md5 muxer, feeding it only the video stream. This means that we are only concerned with the video content, not the audio, subtitles, data or metadata. We may also choose to only hash a selection of the video like the Open Subtitles algorithm: select the first, last and middle x seconds and hash the combination.

I found that running the md5 muxer was reasonably slow when running for the whole video. I scaled the video stream to 32x32 (from here) and got a result a lot quicker.

ffmpeg -i "input" -an -sn -dn -vf scale=32:32 -f md5 -
MD5=337ac1dfefed834b814c27fe7690477e

I also propose that we request a parity token for submitted hashes, to reduce the possibility of garbage data. This will require extra work for the client to calculate, but is a bigger barrier for submitting garbage checksums.

Ch00nassid commented 4 years ago

I have another proposal:

The issue with hashing the entire file per stash's current implementation, and with OpenSubtitle's approach, is that even changing the metadata of the file is enough to change the resulting hash of both algorithms.

I propose that we use ffmpeg's md5 muxer, feeding it only the video stream. This means that we are only concerned with the video content, not the audio, subtitles, data or metadata. We may also choose to only hash a selection of the video like the Open Subtitles algorithm: select the first, last and middle x seconds and hash the combination.

I found that running the md5 muxer was reasonably slow when running for the whole video. I scaled the video stream to 32x32 (from here) and got a result a lot quicker.

ffmpeg -i "input" -an -sn -dn -vf scale=32:32 -f md5 -
MD5=337ac1dfefed834b814c27fe7690477e

I also propose that we request a parity token for submitted hashes, to reduce the possibility of garbage data. This will require extra work for the client to calculate, but is a bigger barrier for submitting garbage checksums.

I would love to help test this, could we all test a common / popular file?

bnkai commented 4 years ago

I've tried the Open Subtitles hash function , c and go version and i have to say it's very fast and doesn't seem to produce duplicates even with same intro/ending scenes.Apart from reading binary the file at start/end it also adds the filesize to the hash so that might be why. @WithoutPants the muxer looks interesting but can we test a common file as @Ch00nassid proposed to make sure different versions of ffmpeg produce the same ? From what i understand it decodes to raw and then calculates the md5 of that stream.

i see a list of videos here for example https://gist.github.com/jsturgis/3b19447b304616f18657

ghost commented 4 years ago

Even if two scenes have the exact same outro/intro I find it highly unlikely the OSO hash will produce a collision. The video header contains all kinds of info about framecount, bitrate, size, encoder version, streams, metadata, creation date, etc. It should be plenty to guarantee uniqueness.

The advantage of OSO is also that it's an established algorithm which has an implementation in any language you can imagine and which doesn't require ffmpeg. It is somewhat resistant to corruption since it simply ignores the majority of the data. Most importantly it's also lightning fast.

Regarding hashing the video stream itself, I've had the same thought, and I think it's a good idea. The only thing I would want to test is how well it works with modifying metadata and remuxing the video stream.

There's also the question of whether we'd want to store straight sha1 hashes of the entire file. This would be useful for validating file integrity, which neither of the other two alternatives can do.

bnkai commented 4 years ago

Imho the more hashes we support the better , it doesn't hurt to have some more fieds and match depending on what you have available. MD5 and SHA1 are tested and could be useful with the only disadvantage being that you need time to read the whole file.Enticing the user with a function like validate file integrity for example one could opt for either of those without caring for the extra time needed.

Leopere commented 4 years ago

https://privatezero.github.io/amiapresentation2017/ was a thing I discovered to give a quick 101 on the subject of perceptual hashing with FFMPEG

cooperdk commented 4 years ago

FFMPEG hashing is far too slow. It simply takes too long... Same with perceptual hashing which is even slower. If you hash a full video, at least. It takes 5-10 times longer than the length of the video to calculate. You can select frames from the video and hash those, but the result would be different if fx the intro is cut, anyway. The Opensubtitles hash is as good as any visual hash system for that exact reason, plus it hashes 2-4 files per second.

ghost commented 4 years ago

FFMPEG hashing is far too slow. It simply takes too long... Same with perceptual hashing which is even slower. If you hash a full video, at least. It takes 5-10 times longer than the length of the video to calculate. You can select frames from the video and hash those, but the result would be different if fx the intro is cut, anyway. The Opensubtitles hash is as good as any visual hash system for that exact reason, plus it hashes 2-4 files per second.

The challenge with oshash, or any other hash, is that it doesn't recognize reencodes, so you'll end up with dozens of hashes for almost everything. WithoutPants has implemented a dupe detector based on a perceptual hash of the sprite, which should recognize reencodes and different resolutions without issue. It might not work if the length is different, but it would still be immensely useful. I'm very keen on trying it out in stash-box once time allows.