jina-ai / jina

☁️ Build multimodal AI applications with cloud-native stack
https://docs.jina.ai
Apache License 2.0
20.93k stars 2.22k forks source link

Text-by-Audio #4163

Closed snaaz21 closed 2 years ago

snaaz21 commented 2 years ago

Hi,

I want to retrieve text by searching for an audio using AudioClip model.

First, I created indexing of text (car-horn, coughing, alarm-clock, thunderstorm etc) and using AudioClipTextEncoder for embedding.

After that I am searching for an audio where i am using AudioClipEncoder for embedding.

For both text and audio indexing simple-indexer is used.

While searching for an audio i created chunks using segmenter and using MyRanker for ranking scores where i modified some script.

The output is: input_audio: AudioCLIP/demo/audio/thunder_3-144891-B-19.wav, len-of-chunks:1 +- id: e5eceb26737311ec83fd2fdc0fc83b8e score: 0.931110, cat input_audio: AudioCLIP/demo/audio/coughing_1-58792-A-24.wav, len-of-chunks:1 +- id: e5ecf968737311ec83fd2fdc0fc83b8e score: 0.923869, cat input_audio: AudioCLIP/demo/audio/cat_3-95694-A-5.wav, len-of-chunks:1 +- id: e5ed00de737311ec83fd2fdc0fc83b8e score: 0.845475, cat input_audio: AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav, len-of-chunks:1 +- id: e5ed0778737311ec83fd2fdc0fc83b8e score: 0.928283, cat input_audio: AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav, len-of-chunks:1 +- id: e5ed0dcc737311ec83fd2fdc0fc83b8e score: 0.919222, alarm_clock

As above for 5 inputs only 2 outputs are correct. Please let me know why is that so ? Since AudioClipEncoder is giving correct outputs for all audios.

JohannesMessner commented 2 years ago

Hi @snaaz21 ! Me and @samsja are still investigating your issue, and have found some things:

JohannesMessner commented 2 years ago

Hey @snaaz21, we have found that there seems to be a problem with the integration between our executor and the AudiCLIP model. We are not entirely sure if the bug is on our side or in the model itself, but we have created a quick fix here: https://github.com/jina-ai/executors/pull/315 If you want to get correct results, taking the code from there should work. But please note that this fix is not release-ready yet, since we have not found the root cause. As such, the changes will not be applied to the published executor for now, and you would have to apply them yourself. I hope this works for your purposes!

snaaz21 commented 2 years ago

Ok @JohannesMessner thank you for the help.

snaaz21 commented 2 years ago

Hey @snaaz21, we have found that there seems to be a problem with the integration between our executor and the AudiCLIP model. We are not entirely sure if the bug is on our side or in the model itself, but we have created a quick fix here: jina-ai/executors#315 If you want to get correct results, taking the code from there should work. But please note that this fix is not release-ready yet, since we have not found the root cause. As such, the changes will not be applied to the published executor for now, and you would have to apply them yourself. I hope this works for your purposes!

it worked, thanks

nan-wang commented 2 years ago

@snaaz21 Many thanks for raise this up. As we dig deeper into the issue, we found that it is an upstream issue. The AudioCLIP model is not using .eval() model during inference.

image

This leads to using varying parameters for batch normalization during inference, which should fixed either to the value of the last batch or a global one. You can check out the code snippets for this bug at https://github.com/jina-ai/executors/pull/315#issuecomment-1020783742.