Open shenberg opened 8 months ago
Hi @shenberg
The current source code is for indexing the smaller databases. Hence, I used float16 wherever possible to avoid excessive memory usage. The possible source of error could be how you preprocess the audio you input to the model or how you add distortions to the audio query. Note that we resample the audio to 16kHZ and do not perform any filtering as a preprocessing step. Do you use our modules to load and add distortions to clean audio segments?
Hiya, thanks for the response!
I use different code to perform the distortions as I'm comparing performance to audfprint and other systems. I used the code at [https://github.com/deezer/musicFPaugment] modified to generate 16KHz distorted samples and to save them to .wav.
Note that audfprint configured to downsample the same files to 8KHz achieved ~70% accuracy.
Hi,
Thank you for this repository! I'm exploring the space of audio fingerprints and you have the only modern repository that just works. With very minor modifications, I got it running on a mac, with a newer version of pytorch, on the GPU, for big performance gains!
I attempted to generate a database for FMA-large by downloading your model weights and modifying the relevant configurations. I encountered three problems along the way:
emb_db/np.linalg.norm(emb_db, axis=1).reshape(-1,1)
), the math was performed in float16 (utils/dataclass.py
) and the norm calculation overflowed for some fingerprints. Casting emb_db to np.float32 solved this (it's enough to do this only for the norm calculation).faiss
crashed in many ways. The reason is that onlyconda install
is supported whilepip install
may work but is unsupported according to the authors. Github issue comment here, and mixing pip installed pytorch and faiss causes issues with OpenMP. My solution was to conda install as much as possible (conda install -c pytorch faiss-cpu pytorch::pytorch torchvision torchaudio
and thenconda install scipy matplotlib
and only thenpip install natsort pytorch-lightning==1.9.5 soundfile
).Anyhow, after I got all the issues sorted out, I generated some 10,000 clean 8-second queries from fma-large and queried them against the DB. My accuracy was ~91% and on listening to a few mistakes, they were "reasonable". When I tested it against queries with degradations, accuracy dropped down to 21% (some details: noise from is from TUT, SNR between 0 and 5, RIR convolutions from the MIT RIR survey, highpass filter randomly between 0-30Hz - I didn't invent this, taken from [https://github.com/deezer/musicFPaugment] configuration 'full_light'). Note that these same settings, but in 8KHz, got ~65% accuracy with
audfprint
.Can you help me figure out the difference? (This is the model from "Attention-based Audio Embeddings for Query-by-Example," right?)