Open nitotm opened 4 years ago
I don't think only the first 8x8 is taken. The perceptual hash one is not fully ready yet. The others work pretty well.
What I understand is: The final hash, is the first 8x8 pixels of the Matrix, compared against the median or average (of the same 8x8), which gives a 0 or 1. But matrix is 32x32.
The whole 32x32 matrix has to be calculated *(NOT TRUE, findings in next post) to produce good results, and the first 8x8 also is gives the best results, according to my empiric testing, not because I understand how the matrix is done (I'm still reading about Discrete Cosine Transformations to figure it out).
I was trying to figure out if it’s worth to store more "hash", from the matrix, if the pixels beyond the 8x8 could be relevant, since they are already calculated. Now for what I understand is not worth it unless you want a second hash with less relevant and more specific data points to further order by similarity the images matched by the first hash.
What’s lacking to the perceptual hash to be ready? I believe according to my tests that is the best performer. I'm probably going to use it, actually I might integrate it to my app within a couple of days.
I was not seeing why the whole 32*32 matrix was calculated, and before I checked incorrectly that was necessary (at least for the code to produce the same results), so I tried again.
In short, now I'm producing only a 8x8 matrix, which gives the same exact final hash result, AND I reduced the total execution time by a +75%, from 210ms to 50ms for an image, or 12 sec to 2,8 sec for my entire benchmark test.
I'm just doing : $matrix_size=8; ... $rows[$y] = $this->calculateDCT($row,$matrix_size); ... for ($x = 0; $x < $matrix_size; $x++) { ... $matrix[$x] = $this->calculateDCT($col,$matrix_size); ... function calculateDCT(array $matrix, $this_size=false){ ... $size_i = ( $this_size ?: $size ); for ($i = 0; $i < $size_i; $i++) { ... $pixels = array_merge(...$matrix); // Removing the pixels 'for' loop, this is faster
After this findings I'm not confident on how the DCT matrix is done.
UPDATE: I can confirm the DCT is correct, I test it with an exemple ->" "on wikipedia and I got the correct results, exemple column A I got (you have to add colors r+g+b raw): https://en.wikipedia.org/wiki/File:Dct-table.png
[0] => 6.1917162 [1] => 0.2205060937849 [2] => 1.0422963188867 [3] => -0.23402753571438 [4] => 0.2750022 [5] => 0.065252207972911 [6] => 0.31692732314466 [7] => -0.29696260903528 https://en.wikipedia.org/wiki/Discrete_cosine_transform
Well I did some heat maps of the matrix and it showed expected result, so I'm guessing the DCT is working ok. Its supposed to give a top-let to bottom-right in diagonal result.
Also note that the first bit seems to be always 1, I don't know if theoretically but in practice, so it can be discarded, or changed by a 0 making it easier to handle on software without unsigned 64bit integer, or add the bit "65", so move it one place. At phash.org, they start the matrix at 2x2 (1x1 if we start at 0) instead of 1x1, discarding first row and column, for example 1x8 which is not really low frequency, the idea was to discard the very low frequency data, since its less useful, but as said it’s supposed to advance diagonally, I have to say that in some heat maps I actually observed lower frequencies on row and column 1 but, not enough to start the matrix at 2x2. My view on this is discard 1x1 for sure, and maybe 1x2 and 2x1, but not more.
As I said the order of the matrix is relevant, it is not the most optimal to just pick the first 8x8 from the 32x32 image, the reason for the 8x8 in the first place, is to get the lower frequencies of the DCT matrix (maybe discarding the super low) because the higher ones are not good for our intensions. So with that in mind, if the matrix frequency advances diagonally it only makes sense to order it diagonally, this will provide:
More information for our 8bytes of data, not only the bits will contain information, but the order too, more bang for our bucket.
This provides the possibility to partition a long hash, and you now the first bits are the most likely to match other similar images, not something everybody will apply in some form but now it’s a possibility.
Finally and more important, a difference on the first bits would be more relevant, so you could apply a formula to add more weight for earlier differences. You might not be able to do that on a SQL query but it can be done with a result set of hashes.
The green and yellow heatmap 32x32 image, is the one I made.
function diagonalMatrix($mat,$size=11,$half=true){
$mode = 0;
$it = 0;
$lower = 0;
$result=[];
$max = ( $half ? ceil( (($size*$size)/2)+($size*0.5) ) : 0 );
for ($t = 0; $t < (2 * $size - 1); $t++) {
$t1 = $t;
if ($t1 >= $size) {
$mode++;
$t1 = $size - 1;
$it--;
$lower++;
}
else {
$lower = 0;
$it++;
}
for ($i = $t1; $i >= $lower; $i--) {
if ( $half && count($result)>=$max){ return $result;}
if ( ($t1 + $mode) % 2 == 0 ) {
$result[] = $mat[$i][$t1 + $lower - $i];
}
else {
$result[] = $mat[$t1 + $lower - $i][$i];
}
}
}
return $result;
}
My initial version of the phash was a port of some existing code I found. Looks like you got quite a good understanding of how the algorithm works! Do you see code improvements you could do?
Apart from the already commented (the unnecessary calculations, the diagonal order of the final hash, and discarding the first bit), now I'm looking at the median/average method: I was ready to use the “median” and almost finish the job here, because it’s giving me better results than “average” and its what’s used on phash.org, but I realized that with median we are not really getting all 2^64 possibilities, it's inherit in the way the algorithm is done, it happens with average too but much more with median because 50% of bits will always be 0 and the other 1, so we are not getting the full range that 64bit can offer (not that I'm aiming for a 100% but I think it could be better), increasing the possibility of unwanted collisions.
I’m trying to think of solution that would provide more entropy without damaging the efficacy of finding similar images. But I don't know if its possible, I tried to calculate against a transposed matrix (multiply and change sign), but happened what I expected, I gained entropy, but it lost too much accuracy, 82% to 77%, while both standard average and median are at +82%, so I'm not sure what to try next or if there's even a way.
That or use average, but for that I will need to increase the size of my benchmark to get more solid decisions.
Well I decided to just use average, I believe that in the long run will be the better choice, and accuracy results against the median are very similar. Regarding the further changes I made, now I’m partially calculating an 11x11 matrix, the 11 rows, but then the final matrix I only calculates a little bit more than a half. In order to do this, before the loop where the matrix is created, add:
$row_matrix_size=$matrix_size; … and then $matrix[$x] = $this->calculateDCT($col,$row_matrix_size); $row_matrix_size--; … $pixels = diagonalMatrix($matrix,$matrix_size); $pixels = array_slice($pixels,1,64); // I discard the first one and cut to size since we have 66 points here
By the way the second bit, 60% of cases is '1' in case anyone wants to discard it too.
And thats it, here is a heat map of the 11x11 matrix and then the only portion I calculate. Which I clarify, we are really making a 32*32 DCT matrix, but I only calculate what's necessary to get the data we want from it.
Well I integrated it and I realized it was slower than I thought, when I looked it in comparison with the other image manipulations that my application does, it was a lot. After the >75% improvement where I went from 210ms to 50ms, with all the diagonal 11x11 I went up to 58ms. So I pre-calculated everything in the DCT function, even I made an array for the cosines. That made me go from 58ms to 28ms, I still wanted to go lower. I wanted to work with an image of 26-22px, but I realized that some cosines are very near 0, basically 0, and when the DCT is calculated, it will make the info multiplied by it almost worthless, it still works, it's just a few, but it’s not the ideal matrix. The matrixes 16,32 and 64 don’t have any 0 cosines, every other matrix size has 0s cosines. The case is 16x16 performed better than 26px, very very near of 32px, so I went with the evidence and I’m using 16x16, and I’m at 14ms, and with this 93% improvement from the start, now I'm done.
PD: finally discarted I did one last change, I applied an “inverse” multiplication table I done with the heat tables, I got a small improvement but consistent results, up to +0.8% similarity accuracy, and down 1% (0.5% absolute) in different images. I was not sure how that would turn out, I was afraid that would “damage” the data the DCT calculates but apparently it turned ok. Important to note I do this before calculating the average.
@smithtek, could you please share the code and explain how to use it?
@smithtek, could you please share the code and explain how to use it?
Hi, which part?
I finally discarted the "inverse multiplication table", I was not sure of it, and on new benchmaks it performed worse.
If you are talking about the "pre-calculated", yes I could share, but its just calculating in advance to gain efficiency.
Other than that with the fragments I shared you can make the same system, I guess I could paste it all toghether, but keep in mind I changed some things from your code.
Here is a compilation of everything, you may need to make some small changes to work with your existing code.
At function __construct() you add:
$this->size_sqrt = sqrt(2 / $size);
$this->dct_11[32]=[ // Important to use key 32 so in case somebody changes size without precalculating it will show an error
[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
[0.99879546,0.98917651,0.97003125,0.94154407,0.90398929,0.85772861,0.80320753,0.74095113,0.67155895,0.5956993,0.51410274,0.42755509,0.33688985,0.24298018,0.14673047,0.04906767,-0.04906767,-0.14673047,-0.24298018,-0.33688985,-0.42755509,-0.51410274,-0.5956993,-0.67155895,-0.74095113,-0.80320753,-0.85772861,-0.90398929,-0.94154407,-0.97003125,-0.98917651,-0.99879546],
[0.99518473,0.95694034,0.88192126,0.77301045,0.63439328,0.47139674,0.29028468,0.09801714,-0.09801714,-0.29028468,-0.47139674,-0.63439328,-0.77301045,-0.88192126,-0.95694034,-0.99518473,-0.99518473,-0.95694034,-0.88192126,-0.77301045,-0.63439328,-0.47139674,-0.29028468,-0.09801714,0.09801714,0.29028468,0.47139674,0.63439328,0.77301045,0.88192126,0.95694034,0.99518473],
[0.98917651,0.90398929,0.74095113,0.51410274,0.24298018,-0.04906767,-0.33688985,-0.5956993,-0.80320753,-0.94154407,-0.99879546,-0.97003125,-0.85772861,-0.67155895,-0.42755509,-0.14673047,0.14673047,0.42755509,0.67155895,0.85772861,0.97003125,0.99879546,0.94154407,0.80320753,0.5956993,0.33688985,0.04906767,-0.24298018,-0.51410274,-0.74095113,-0.90398929,-0.98917651],
[0.98078528,0.83146961,0.55557023,0.19509032,-0.19509032,-0.55557023,-0.83146961,-0.98078528,-0.98078528,-0.83146961,-0.55557023,-0.19509032,0.19509032,0.55557023,0.83146961,0.98078528,0.98078528,0.83146961,0.55557023,0.19509032,-0.19509032,-0.55557023,-0.83146961,-0.98078528,-0.98078528,-0.83146961,-0.55557023,-0.19509032,0.19509032,0.55557023,0.83146961,0.98078528],
[0.97003125,0.74095113,0.33688985,-0.14673047,-0.5956993,-0.90398929,-0.99879546,-0.85772861,-0.51410274,-0.04906767,0.42755509,0.80320753,0.98917651,0.94154407,0.67155895,0.24298018,-0.24298018,-0.67155895,-0.94154407,-0.98917651,-0.80320753,-0.42755509,0.04906767,0.51410274,0.85772861,0.99879546,0.90398929,0.5956993,0.14673047,-0.33688985,-0.74095113,-0.97003125],
[0.95694034,0.63439328,0.09801714,-0.47139674,-0.88192126,-0.99518473,-0.77301045,-0.29028468,0.29028468,0.77301045,0.99518473,0.88192126,0.47139674,-0.09801714,-0.63439328,-0.95694034,-0.95694034,-0.63439328,-0.09801714,0.47139674,0.88192126,0.99518473,0.77301045,0.29028468,-0.29028468,-0.77301045,-0.99518473,-0.88192126,-0.47139674,0.09801714,0.63439328,0.95694034],
[0.94154407,0.51410274,-0.14673047,-0.74095113,-0.99879546,-0.80320753,-0.24298018,0.42755509,0.90398929,0.97003125,0.5956993,-0.04906767,-0.67155895,-0.98917651,-0.85772861,-0.33688985,0.33688985,0.85772861,0.98917651,0.67155895,0.04906767,-0.5956993,-0.97003125,-0.90398929,-0.42755509,0.24298018,0.80320753,0.99879546,0.74095113,0.14673047,-0.51410274,-0.94154407],
[0.92387953,0.38268343,-0.38268343,-0.92387953,-0.92387953,-0.38268343,0.38268343,0.92387953,0.92387953,0.38268343,-0.38268343,-0.92387953,-0.92387953,-0.38268343,0.38268343,0.92387953,0.92387953,0.38268343,-0.38268343,-0.92387953,-0.92387953,-0.38268343,0.38268343,0.92387953,0.92387953,0.38268343,-0.38268343,-0.92387953,-0.92387953,-0.38268343,0.38268343,0.92387953],
[0.90398929,0.24298018,-0.5956993,-0.99879546,-0.67155895,0.14673047,0.85772861,0.94154407,0.33688985,-0.51410274,-0.98917651,-0.74095113,0.04906767,0.80320753,0.97003125,0.42755509,-0.42755509,-0.97003125,-0.80320753,-0.04906767,0.74095113,0.98917651,0.51410274,-0.33688985,-0.94154407,-0.85772861,-0.14673047,0.67155895,0.99879546,0.5956993,-0.24298018,-0.90398929],
[0.88192126,0.09801714,-0.77301045,-0.95694034,-0.29028468,0.63439328,0.99518473,0.47139674,-0.47139674,-0.99518473,-0.63439328,0.29028468,0.95694034,0.77301045,-0.09801714,-0.88192126,-0.88192126,-0.09801714,0.77301045,0.95694034,0.29028468,-0.63439328,-0.99518473,-0.47139674,0.47139674,0.99518473,0.63439328,-0.29028468,-0.95694034,-0.77301045,0.09801714,0.88192126]
];
You precalcule "manually" this array with the following, you don't need it after that, I would add it as a comment, in case somebody wants to calculate 16 or 64:
print '$dct_11[32]=['."\n";
for ($i = 0; $i < 11; $i++) {
print ( $i ? ",\n" : '').'[';
for ($j = 0; $j < 32; $j++) {
print ( $j ? ',' : '').round(cos($i * M_PI * ($j + 0.5) / 32),$decimals=8);
}
print "]";
}
print "\n];";
Then on the for loops, keep in mind I'm using imagick directly.
$matrix_size=11;
for ($y = 0; $y < $this->size; $y++) {
for ($x = 0; $x < $this->size; $x++) {
$ipixel = $imagick->getImagePixelColor($x, $y);
$rgb = $ipixel->getColor();
$row[$x] = (int) floor(($rgb['r'] * 0.299) + ($rgb['g'] * 0.587) + ($rgb['b'] * 0.114));
}
$rows[$y] = $this->calculateDCT($row,$matrix_size);
}
$row_matrix_size=$matrix_size;
for ($x = 0; $x < $matrix_size; $x++) {
for ($y = 0; $y < $this->size; $y++) {
$col[$y] = $rows[$y][$x];
}
$matrix[$x] = $this->calculateDCT($col,$row_matrix_size);
$row_matrix_size--;
}
$pixels = diagonalMatrix($matrix,$matrix_size);
$pixels = array_slice($pixels,1,64); // discart first and cut to size
$compare = $this->average($pixels);
Then the two functions you need, the diagonalMatrix() is already pasted on a previous comment, and the calculateDCT():
protected function calculateDCT($matrix, $partial_size)
{
$dct_cos = $this->dct_11[$this->size];
$transformed = [];
for ($i = 0; $i < $partial_size; $i++) {
$sum = 0;
for ($j = 0; $j < $this->size; $j++) {
$sum += $matrix[$j] * $dct_cos[$i][$j];
}
$sum *= $this->size_sqrt;
if ($i === 0) {
$sum *= 0.70710678118655;
}
$transformed[$i] = $sum;
}
return $transformed;
}
@smithtek please review #58 if you have time. I applied all your suggestions and made a separate implementation w/ them. Also, I generated DCTs for 16x and 64x. Would be great if you can validate them because generator code was for 32x and I just replaced all the 32s in it.
@smithtek please review #58 if you have time. I applied all your suggestions and made a separate implementation w/ them. Also, I generated DCTs for 16x and 64x. Would be great if you can validate them because generator code was for 32x and I just replaced all the 32s in it.
Generated DCTs looks good, I checked. I probably won't have time to test your implementation.
@smithtek Sorry, but why didn't you just create a pull request with all the changes and instead describe all your steps here?
I'm currently testing the PR made by @b1rdex using the code from @smithtek, I will report any progress here.
FYI, we use #58 in production since I created the PR for 2-5M images. The number is dependent on the currently active goods number available to buy. We haven't detected any issues with it.
Also, I find much less WTF similarities, event at higher distances.
Take for example: The two images are actually very similar.
Still, here's a funny duplicate:
At higher distances, it still finds some similar pictures but consistency tends to drop around 11~12: For this example, I don't understand how the two images on the top can be separated by the same distance as the two images on the bottom...
I think #58 is really good as-is, no need for additional improvements.
Additional thoughts in the discussion concerning collisions here #27.
The matrix is 32x32, are the first 8x8 ordered in any relevant way?
Also which hash is performing better for you?