pwlmaciejewski / imghash

Perceptual image hashing for Node.js
MIT License
170 stars 19 forks source link

Two different pictures, with the same hash? #46

Closed snoopy83101 closed 4 years ago

snoopy83101 commented 4 years ago

http://file.geeknt.com/upload/20200612/4a5a7f5c-5417-4c53-bf1c-23dcf2b12c24.jpg http://file.geeknt.com/upload/20200612/3512a09d-4f81-4518-9391-a253e631657f.jpg

imghash .hash(o.path, 4, "binary") .then((hash) => { resolve(hash); //1100110011001100 }) .catch((e) => { reject(e); });

why?

I want to make them unique, and when generating a hash, don't take up too much system resources, how can I do it?

pwlmaciejewski commented 4 years ago

Hi @snoopy83101

Increasing "bits" parameter you should do the trick. Please try this:

imghash
  .hash(o.path, 8, "binary")
  .then((hash) => {
    resolve(hash);
  })
  .catch((e) => {
    reject(e);
  });

It will return a longer hash with a higher resolution. Hashes for the two images should no longer be the same either.

snoopy83101 commented 4 years ago

@pwlmaciejewski Thanks for the reply, if the length is 8, the two pictures are indeed returned differently. But if I have ten thousand pictures, does he still have the same probability?

pwlmaciejewski commented 4 years ago

@snoopy83101 It depends heavily on the similarity of the pictures. If not having collisions is your priority then use a very high bit length, eg. 256 and more. Collisions are still possible since you can never rule them out, but with long hashes, you should be relatively safe.

snoopy83101 commented 4 years ago

@snoopy83101 It depends heavily on the similarity of the pictures. If not having collisions is your priority then use a very high bit length, eg. 256 and more. Collisions are still possible since you can never rule them out, but with long hashes, you should be relatively safe.

@pwlmaciejewski hi, I also want to ask about the probability of repetition

imghash.hash(o.path, 12, "binary"): 111111100011111110000000111100000000101110010000011100000111111010000111100000000110100111111110100111000110110010000111011000000111111000001111

imghash.hash(o.path): f884c4d8d1193c07

Which one has the greater probability of repetition?

Which one will consume more system resources?

pwlmaciejewski commented 4 years ago

@snoopy83101

Both take a similar amount time:

# imghash.hash(o.path, 12, "binary")
time sh -c 'for i in {1..200}; do imghash -b 12 -f binary Lenna.png > /dev/null; done;'
sh -c   83,74s user 5,10s system 286% cpu 31,044 total
# imghash.hash(o.path)
time sh -c 'for i in {1..200}; do imghash Lenna.png > /dev/null; done;'
sh -c 'for i in {1..200}; do imghash Lenna.png > /dev/null; done;'  84,06s user 5,31s system 297% cpu 30,002 total

The amount of consumed resources depends on the images you process so both should take a similar amount as well.

As for the probability of collision, please refer to the blockhash algorithm page.

For images in general, the algorithm generates the same blockhash value for two different images in 1% of the cases (data based on a random sampling of 100,000 images).

For photographs, the algorithm generates practically unique blockhashes, but for icons, clipart, maps and other images, the algorithm generates less unique blockhashses. Larger areas of the same color in an image, either as a background or borders, result in hashes that collide more frequently.