protyposis / Aurio

Audio Fingerprinting & Retrieval for .NET
GNU Affero General Public License v3.0
140 stars 28 forks source link

Comparing short audio files #16

Closed galarlo closed 11 months ago

galarlo commented 1 year ago

Hi, I'm interested in finding near-duplicate audio files. My dataset is about 3000 thousands short audio files, between 0.5 seconds to 5 seconds. Unlike Shazam, both the "target" audio (i.e. the songs in Shazam's case) and the user input are short, and both might contain noise.

Can this library help? If so, are there any recommendations for tuning parameters?

N.B - if a file is matched to multiple other files, it's fine - I have a less efficient algorithm that can verify which match is correct. In other words, I can handle some amount of false positives, but I don't want false negatives.

protyposis commented 1 year ago

Yes, if both your dataset and the user input contain audio signals from the same "root source" (in the Shazam example, e.g., the same version of the same song by the same artist), then it can work even when both contain noise. This library can help, but given the short dataset items and queries, you'll need a custom configuration which is tweaked to generate short fingerprints (below 0.5 seconds) and I can't provide you a specific recommendation. It also makes sense to evaluate your dataset with all 4 fingerprint algorithms to find the one which works best for your use case.

galarlo commented 1 year ago

@protyposis Thanks very much for the quick response :) If it'll work, it'll solve a big problem for me.

Couple of questions:

  1. I've started playing around with the library, and noticed that it doesn't output matches between files. How can I transform Aurio's outputs into file matches?

  2. It also makes sense to evaluate your dataset with all 4 fingerprint algorithms to find the one which works best for your use case.

Correct me if I'm wrong, but AcoustID doesn't seem appropriate for my use-case. My use-case is similar to Shazam, in which the audio can be recorded on a phone's microphone with background noise (in e.g. a bar).
I'm suspicious of AcoustID because of the following reasons:

protyposis commented 1 year ago
  1. If you follow this example, you get a list of matches as a result, where each match object refers to a tuple of AudioTracks that reference the audio files (audioTrack.FileInfo).
  2. For noise robustness, I suggest starting with the Shazam and Philips fingerprints. Echoprint and Chromaprint are basically weaker implementations of these two but might be safer to use if patents are of relevance. All four can handle noise, but the first two are much more robust, and their applicability ultimately depends on how strong the noise is - that is for you to decide since you know your data. It seems to me that AcoustID is a specific use case of Chromaprint, deliberately tuned to handle full music tracks. The used Chromaprint algorithm itself generates more fine-grained fingerprints and is technically capable of matching short files and subsections of files. However, as mentioned, it is certainly not the best performer and works better the longer the compared audio signal is. https://github.com/protyposis/AudioAlign makes it quite simple to compare the algorithms. Just drag a set of samples in it, go to Match & Align, select the desired fingerprint, and click Find Matches. This uses the default settings though, and as mentioned in my previous comment, you will probably have to tweak settings for your use case, e.g., by changing parameters in the DefaultProfile of each fingerprint.
galarlo commented 1 year ago

@protyposis thanks Mario.

  1. I've played with that example. From what I understand, it returns matches between sub sections of the audio files. However, I'm interested in finding similarity between whole files (when one of the files may only be a subset of the second, like in Shazam). How do I transform the subsections matches into whole file matches? I've some naive ideas about how to do it (e.g. averaging the subsections matches similarity scores, weighted by lengths), but I'd like to know if there's a better recommendation.

  2. Thanks, very informative.

protyposis commented 1 year ago

If you really want to make sure that a track matches across its entire runtime then you need to assert that there are matching fingerprints across the whole duration, e.g., for music tracks, make sure there is a match at least every 15 seconds from start to end, but permit a few gaps too (for robustness to silent or over-noisy sections). The similarity scores are mainly meant for ranking results, so better not average the raw result as bad matches will spoiling your average. Rather add a filtering step to first pick a sequence of the best ones, e.g. by cutting a music track into sections of 15 seconds and picking the best match each, and then calculate the average if needed, like you suggested. Also, all returned matches are basically considered true positives, as false positives are not returned (can be tweaked though). Keep in mind that this is a special use case and really only needed if your data contains remixed/concatenated signals. Normally a few seconds are enough to reliably identify a piece.

In your case you'll have to do that in 0.5 or 1 second intervals, and as mentioned tweak the profiles for shorter fingerprints. The default fingerprint length of the Shazam fingerprint is 0.5 to 22 seconds (might work in your case out of the box), of the Philips 8 seconds (won't work in your case ootb).

protyposis commented 11 months ago

Closing due to inactivity.