madhavlab / audsearch

3 stars 4 forks source link

Performance on FMA-large not so great #1

Open shenberg opened 8 months ago

shenberg commented 8 months ago

Hi,

Thank you for this repository! I'm exploring the space of audio fingerprints and you have the only modern repository that just works. With very minor modifications, I got it running on a mac, with a newer version of pytorch, on the GPU, for big performance gains!

I attempted to generate a database for FMA-large by downloading your model weights and modifying the relevant configurations. I encountered three problems along the way:

  1. When normalizing the vectors (emb_db/np.linalg.norm(emb_db, axis=1).reshape(-1,1)), the math was performed in float16 (utils/dataclass.py) and the norm calculation overflowed for some fingerprints. Casting emb_db to np.float32 solved this (it's enough to do this only for the norm calculation).
  2. The metadata similarly overflowed (MAX_VAL for float16 is ~65000 and there are more than 65k songs). Again, moving to float32 solved the issue.
  3. faiss crashed in many ways. The reason is that only conda install is supported while pip install may work but is unsupported according to the authors. Github issue comment here, and mixing pip installed pytorch and faiss causes issues with OpenMP. My solution was to conda install as much as possible (conda install -c pytorch faiss-cpu pytorch::pytorch torchvision torchaudio and then conda install scipy matplotlib and only then pip install natsort pytorch-lightning==1.9.5 soundfile).

Anyhow, after I got all the issues sorted out, I generated some 10,000 clean 8-second queries from fma-large and queried them against the DB. My accuracy was ~91% and on listening to a few mistakes, they were "reasonable". When I tested it against queries with degradations, accuracy dropped down to 21% (some details: noise from is from TUT, SNR between 0 and 5, RIR convolutions from the MIT RIR survey, highpass filter randomly between 0-30Hz - I didn't invent this, taken from [https://github.com/deezer/musicFPaugment] configuration 'full_light'). Note that these same settings, but in 8KHz, got ~65% accuracy with audfprint.

Can you help me figure out the difference? (This is the model from "Attention-based Audio Embeddings for Query-by-Example," right?)

anupsingh15 commented 7 months ago

Hi @shenberg

The current source code is for indexing the smaller databases. Hence, I used float16 wherever possible to avoid excessive memory usage. The possible source of error could be how you preprocess the audio you input to the model or how you add distortions to the audio query. Note that we resample the audio to 16kHZ and do not perform any filtering as a preprocessing step. Do you use our modules to load and add distortions to clean audio segments?

shenberg commented 7 months ago

Hiya, thanks for the response!

I use different code to perform the distortions as I'm comparing performance to audfprint and other systems. I used the code at [https://github.com/deezer/musicFPaugment] modified to generate 16KHz distorted samples and to save them to .wav.

Note that audfprint configured to downsample the same files to 8KHz achieved ~70% accuracy.