finetuning on short audios

haiderasad commented 1 year ago

hi, I have about 900 short audios of users speaking, after training the main model for 20 epochs(converging very well) I try to finetune it on my own data but after 130 epochs the loss does not converge and stays in the range of 0.99-0.95 . Are there any configurations I am missing that need to be altered in the case of nonmusic data

mimbres commented 1 year ago

@haiderasad Hi, thanks for your interest in our work. Unfortunately, I haven't tried fine-tuning on speech data. Let me first clarify your down-stream task. Do you want to search the exact location of 1s input segment within speech audio samples? If so:

Have you tried freezing some parts of network?
How long is the 900 short speech data? If the data is small, I think you can use some external speech dataset such as Librispeech (clean) and CommonVoice (noisy).
From my experience, the choice of batch-sampling method can be critical in fine-tuning. For example, sampling a portion of your training-batch from the external data, and another portion from your data.
Larger batch size helps. Slightly smaller learning rate would help.
Proper augmentation method is critical. Make sure that speech-augmention is turned off in your config file.
Turn off the default cosine-scheduler. Modify the schedule after observing learning-curve with small learning rate.

As of not converging on small dataset:

Speech samples may have a lot of silence. In that case, using longer semgnet length like 2 sec would help.
Using larger temperature 'tau' ( < 1.0) would help. Although there is trade-off, this can resolve underfitting issue. You can modify the value in the config file.

haiderasad commented 1 year ago

@mimbres thanks for replying, to answer your questions

my downstream task is for "audio password verification", where I want to do 1-1 matching between 2 audios and verify if the password spoken is similar or not.
No I have not frozen any layer yet but will try this approach
The short audios range between 1-10 seconds of spoken sentences (Silence has been trimmed away using a Voice activity detection module) , also I observed that the model does not train on 1-sec audios(it gives low value error)
I was also suspecting that the augmentations were destroying the data at train time, i will turn of the speech aug

mimbres commented 1 year ago

@haiderasad I think audio password verification would require both 1. speaker identification and 2. audio content matching. 1 is relatively well established area. As of 2, most of existing ASR (automatic speech recognition) models would not recognize non-speech (e.g. whistling) content well. 2 would involve modeling somewhat a rough contour of the fundamental pitch and timbre changes.

In general, audio fingerprinting (FP) defines the concept of "content" differently. FP is designed to identify exactly the same source. The human voice always changes a little bit each time it sounds. This is very different from the case that you record your voice and play it back on another device. So, in my opinion, FP is probably not suitable for your innovative project...

haiderasad commented 1 year ago

@mimbres thanks for the valuable insight, Regarding the technique for 1-1 matching of audio passwords, i have 2 checks,

No of embeddings generated, e.g if the registered password FPs has (4,512) and the incoming audio has not more than 5-6 and not less than 3(an offset) 512 embeddings then the first check passes
apply max inner product score, meaning there should be one "n x m" dot product score above a specific threshold

Can you verify if the MIPS method mentioned above is correct for my use case, as i do not want to use the FAISS library because my use case is for small audios with 1-1 matching

mimbres commented 1 year ago

@haiderasad Yes, your exhaustive search by MIPS is the same as FAISS without quantization.

mimbres / neural-audio-fp

finetuning on short audios #37