Closed haiderasad closed 1 year ago
@haiderasad Hi, thanks for your interest in our work. Unfortunately, I haven't tried fine-tuning on speech data. Let me first clarify your down-stream task. Do you want to search the exact location of 1s input segment within speech audio samples? If so:
As of not converging on small dataset:
@mimbres thanks for replying, to answer your questions
@haiderasad I think audio password verification would require both 1. speaker identification and 2. audio content matching. 1 is relatively well established area. As of 2, most of existing ASR (automatic speech recognition) models would not recognize non-speech (e.g. whistling) content well. 2 would involve modeling somewhat a rough contour of the fundamental pitch and timbre changes.
In general, audio fingerprinting (FP) defines the concept of "content" differently. FP is designed to identify exactly the same source. The human voice always changes a little bit each time it sounds. This is very different from the case that you record your voice and play it back on another device. So, in my opinion, FP is probably not suitable for your innovative project...
@mimbres thanks for the valuable insight, Regarding the technique for 1-1 matching of audio passwords, i have 2 checks,
Can you verify if the MIPS method mentioned above is correct for my use case, as i do not want to use the FAISS library because my use case is for small audios with 1-1 matching
@haiderasad Yes, your exhaustive search by MIPS is the same as FAISS without quantization.
hi, I have about 900 short audios of users speaking, after training the main model for 20 epochs(converging very well) I try to finetune it on my own data but after 130 epochs the loss does not converge and stays in the range of 0.99-0.95 . Are there any configurations I am missing that need to be altered in the case of nonmusic data