tattle-made / DAU

MCA Tipline for Deepfakes
GNU General Public License v3.0
6 stars 0 forks source link

Count occurrences of the same media item #89

Open dennyabrain opened 6 months ago

dennyabrain commented 6 months ago

For every media item that we receive on the tipline, we need to show to the users how many occurences of this exact file exist on the server. Given our infra, the scope of this task is to

aatmanvaidya commented 6 months ago

1. Identify what is the best technique to do this

All the popular methods use hashing

filecmp

Audio similarity

Video Simiarity

aatmanvaidya commented 6 months ago

In the short term hashing is the way to check if files are same or not. It is also the fastest way to do so

Are SHA-256 and SHA-512 collision resistant?

We call an event is-not-gonna-happen if it has probability <1/2^100

You can use any 512-bit cryptographic hash function like SHA-512, SHA3-512, and BLAKE2b without fear of collision. You may look at BLAKE2b quite fast compared to alternatives and its parallel version BLAKE3.

dennyabrain commented 6 months ago

In the short term hashing is the way to check if same or not. It is also the fastest way to do so

Great. Then lets move onto checking if they apply for our use case. Do share the various way you test out media items received on whatsapp. To keep as a log of things that worked and which did not.

aatmanvaidya commented 6 months ago

In the short term hashing is the way to check if same or not. It is also the fastest way to do so

Great. Then lets move onto checking if they apply for our use case. Do share the various way you test out media items received on whatsapp. To keep as a log of things that worked and which did not.

Yes I have started working on this, can you also see the updated comment with the stackoverflow link that talks about how sha256 and sha512 are collision resistant

We should consider using Blacke3 over sha512. It is much much faster

aatmanvaidya commented 6 months ago

Time taken by blake2b to find the hash of audio and video files of different lengths and sizes

Audio

Media Type - Length Time Taken
audio - 30s 0.018s
audio - 60s 0.027s
audio - 120s 0.056s
audio - 300s 0.122s
audio - 600s 0.234s
audio - 1200s 0.425s
audio - 1800s 0.631s

Video

Media Type - Length Time Taken
video - 30s 0.0081s
video - 60s 0.013s
video - 300s 0.022s
video - 600s 0.074s
video - 1200s 0.087s
video - 1800s 0.148s
video - 3600s 0.33s