Perceptual hash extensions

pozieuto commented 4 years ago

I end up with a lot of false negatives for some alternate types when using the existing perceptual hash system, so I've been considering how to expand upon it.

My understanding of the current system, based on resources linked by HyDev in the past, is that it does these three things in some order:

Value reduction: reduce all pixels to a single value (ie. make the image opaque greyscale)
Size reduction: scale the image down to 8x8 (64 pixels), disregarding the aspect ratio, by averaging the pixel values together.
Criterion determination: determine a value to compare the 64 sections against, by averaging either all the pixels in the image or just the 64 values (not sure which Hydrus does).

The final step is comparing the 64 values against the criterion to assign them a binary value depending on if they're smaller or larger.

Presently, I believe Hydrus is limited to averaging for the size reduction, greyscaling for the value reduction, and averaging again for the criterion determination. But there are other operations that could be done for these steps instead, which might prove more useful in detecting certain types of alternates:

Value reduction
- Reduce RGB channels to one value, discarding the alpha channel (current?)
- Use one of the RGB channels by itself (most useful in a search that combines these three hash searches with different hamming distances to "weigh" the components differently)
- Use the alpha channel
- Use the saturation or chroma level
- Use the value or lightness level
Size reduction: calculation
- Average the values (current?)
- Determine the median of the values
- Use the maximum value
- Use the minimum value
Size reduction: dimensions
- Stretch to 8x8 (current?)
- Reduce to whichever size of 2ⁿ x 2⁶⁻ⁿ (n=0-6) is closest to the image's aspect ratio
Criterion determination: data
- Use all the pixels in the image
- Use the 64 values
Criterion determination: calculation
- Average the values (current?)
- Determine the median of the values

Not all of these combinations would necessarily be valuable, but I believe at least some of these would be effective in practice for certain types of alternates, so having them available as options for advanced searches would be useful. Of course, something very important to consider is that all combinations of these variants would produce a lot of different hashes - far more than is reasonable to store for every image. The searching functionality would need to be expanded upon to make this useful.

Expanded UI: Ideally, users would be able to determine what perceptual hash methods work best for certain data by taking a set of known duplicates or alternates and trying out multiple different hash searches on it (either separately or in combination), with Hydrus showing a table or bar chart of how many matches are being made at each hamming distance, as well as showing visual representations of the hashes the images generate for reference.
Match inversions: Users should also be able to search such that every search takes the lowest hamming distance between both a normal search and a search where one of the hashes has its bits flipped (equivalent to considering both minimal and maximal hamming distances for a normal search), as for some of these search types, hashes that are as different as possible may simply represent an inversion.
Match transformations: Being able to compare rotated and flipped hashes for finding transformed versions would also be useful, though this wouldn't work when the size reduction dimensions don't match. Maybe those dimensions could just be stored with the hash for these cases so transformation searches are only done where it makes sense (the seven possibilities fit nicely into 3 bits). The filter would need added functionality to transform the previews so they line up.
Bitmasked search: Very specialized searches could benefit from being able to only consider certain bits of the hash as important, or set different hamming distances for different areas.

Other hash types may also be feasible. My two ideas here are for hue and "features".

For a hue hash: Since the hue is a "circular" value space rather than a line with a discrete start and end, any usual calculations like an "average" would be worthless. Instead, the most "dominant" hue needs to be determined somehow, such that two alternates are more likely to have similar hashes (assuming their dominant hues closely match). Having every pixel in the image add to the score of its hue relative to how saturated or transparent it is (so fully unsaturated or transparent pixels contribute nothing) would produce an approximate frequency distribution of perceived hues, on which some statistical analysis needs to be done to determine a dominant value. What seems most intuitive to me would be to smooth the dataset using kernel density estimation and use the hue at highest point of the resulting curve. Then, imagining the hue spectrum as a "colour wheel", slice it with a straight line through the midpoint with an angle matching that of the tangent at the dominant hue's position on the circle. We now have two equal sections of the hue spectrum that we can use to assign binary values to different segments of the image as before. This division could also be done such that the two points dividing the hue spectrum remain equidistant to the dominant hue but are adjusted such that the two sections of the hue spectrum capture approximately half of the hue frequency distribution (making the two sections most likely represent different amounts of the hue spectrum, by making a "V" shaped slice in the "wheel"). Then for each of the 64 sections of the image, determine the dominant hue (like before), and assign a binary value for that section based on which of the two sections determined earlier of the hue spectrum that its dominant hue falls into. This step would also benefit from some kind of threshold to account for transparency and chroma noise such that if there's hardly any or no colour to the section at all, then it is assigned the binary value corresponding to the section of the hue spectrum that does not include the image's dominant hue.

For a "feature" hash: Before performing any reductions, run an edge detection or phase stretch transformation algorithm on the image to produce an image of its "features", and then do a specialized perceptual hash of that.

Currently all of this is theoretical, and I haven't done any implementations or tests to determine the applicability of these different ideas in practice.

bbappserver commented 4 years ago

I would instead recommend that an API endpoint be exposed to create.remove a potential pair and that third parties be permitted to write their own phashing.

False negatives should be exceptionally rare, but if they aren't hydru's last step should be a straight compare after the broad phase pHash.

pozieuto commented 4 years ago

The idea of the perceptual hashing system is to be performant while also being reasonably accurate. Full-image comparisons are prohibitively intensive for large sets of media by comparison. Pumping image data through an API is also not as performant as natively doing hashes and searches along with other maintenance, I would think. The hashes also need to be stored in Hydrus' database for them to be useful in practice.

DonaldTsang commented 4 years ago

Existing research: https://github.com/thorn-oss/perception @pozieuto take a look at some of the new algorithms they have.

hydrusnetwork / hydrus

Perceptual hash extensions #267