Text-by-Audio - Githubissues

snaaz21 commented 2 years ago

Hi,

I want to retrieve text by searching for an audio using AudioClip model.

First, I created indexing of text (car-horn, coughing, alarm-clock, thunderstorm etc) and using AudioClipTextEncoder for embedding.

After that I am searching for an audio where i am using AudioClipEncoder for embedding.

For both text and audio indexing simple-indexer is used.

While searching for an audio i created chunks using segmenter and using MyRanker for ranking scores where i modified some script.

The output is: input_audio: AudioCLIP/demo/audio/thunder_3-144891-B-19.wav, len-of-chunks:1 +- id: e5eceb26737311ec83fd2fdc0fc83b8e score: 0.931110, cat input_audio: AudioCLIP/demo/audio/coughing_1-58792-A-24.wav, len-of-chunks:1 +- id: e5ecf968737311ec83fd2fdc0fc83b8e score: 0.923869, cat input_audio: AudioCLIP/demo/audio/cat_3-95694-A-5.wav, len-of-chunks:1 +- id: e5ed00de737311ec83fd2fdc0fc83b8e score: 0.845475, cat input_audio: AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav, len-of-chunks:1 +- id: e5ed0778737311ec83fd2fdc0fc83b8e score: 0.928283, cat input_audio: AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav, len-of-chunks:1 +- id: e5ed0dcc737311ec83fd2fdc0fc83b8e score: 0.919222, alarm_clock

As above for 5 inputs only 2 outputs are correct. Please let me know why is that so ? Since AudioClipEncoder is giving correct outputs for all audios.

JohannesMessner commented 2 years ago

Hi @snaaz21 ! Me and @samsja are still investigating your issue, and have found some things:

It looks like you are not using the correct sampling rate. For .wav-files this should be set 44100, not 16000.
There also seems to be something wrong on our end on the Executor level. We are still looking into this and will report back once we know more. Thanks for your patience!

JohannesMessner commented 2 years ago

Hey @snaaz21, we have found that there seems to be a problem with the integration between our executor and the AudiCLIP model. We are not entirely sure if the bug is on our side or in the model itself, but we have created a quick fix here: https://github.com/jina-ai/executors/pull/315 If you want to get correct results, taking the code from there should work. But please note that this fix is not release-ready yet, since we have not found the root cause. As such, the changes will not be applied to the published executor for now, and you would have to apply them yourself. I hope this works for your purposes!

snaaz21 commented 2 years ago

Ok @JohannesMessner thank you for the help.

snaaz21 commented 2 years ago

Hey @snaaz21, we have found that there seems to be a problem with the integration between our executor and the AudiCLIP model. We are not entirely sure if the bug is on our side or in the model itself, but we have created a quick fix here: jina-ai/executors#315 If you want to get correct results, taking the code from there should work. But please note that this fix is not release-ready yet, since we have not found the root cause. As such, the changes will not be applied to the published executor for now, and you would have to apply them yourself. I hope this works for your purposes!

it worked, thanks

nan-wang commented 2 years ago

@snaaz21 Many thanks for raise this up. As we dig deeper into the issue, we found that it is an upstream issue. The AudioCLIP model is not using .eval() model during inference.

This leads to using varying parameters for batch normalization during inference, which should fixed either to the value of the last batch or a global one. You can check out the code snippets for this bug at https://github.com/jina-ai/executors/pull/315#issuecomment-1020783742.

jina-ai / jina

Text-by-Audio #4163