jina-ai / jina

☁️ Build multimodal AI applications with cloud-native stack
https://docs.jina.ai
Apache License 2.0
20.93k stars 2.22k forks source link

Text-by-Audio #4163

Closed snaaz21 closed 2 years ago

snaaz21 commented 2 years ago

Hi,

I want to retrieve text by searching for an audio using AudioClip model.

First, I created indexing of text (car-horn, coughing, alarm-clock, thunderstorm etc) and using AudioClipTextEncoder for embedding.

After that I am searching for an audio where i am using AudioClipEncoder for embedding.

For both text and audio indexing simple-indexer is used.

While searching for an audio i created chunks using segmenter and using MyRanker for ranking scores where i modified some script.

The output is: input_audio: AudioCLIP/demo/audio/thunder_3-144891-B-19.wav, len-of-chunks:1 +- id: e5eceb26737311ec83fd2fdc0fc83b8e score: 0.931110, cat input_audio: AudioCLIP/demo/audio/coughing_1-58792-A-24.wav, len-of-chunks:1 +- id: e5ecf968737311ec83fd2fdc0fc83b8e score: 0.923869, cat input_audio: AudioCLIP/demo/audio/cat_3-95694-A-5.wav, len-of-chunks:1 +- id: e5ed00de737311ec83fd2fdc0fc83b8e score: 0.845475, cat input_audio: AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav, len-of-chunks:1 +- id: e5ed0778737311ec83fd2fdc0fc83b8e score: 0.928283, cat input_audio: AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav, len-of-chunks:1 +- id: e5ed0dcc737311ec83fd2fdc0fc83b8e score: 0.919222, alarm_clock

As above for 5 inputs only 2 outputs are correct. Please let me know why is that so ? Since AudioClipEncoder is giving correct outputs for all audios.

JoanFM commented 2 years ago

Why do you consider these outputs not correct? what are the expected result and its score?

snaaz21 commented 2 years ago

correct outputs means its not retrieving the exact sequence output, for "coughing.wav" it should give text "coughing" first like that. above are the minimum cosine similarity scores.

I am new in this. please do correct me where did i do wrong?

JoanFM commented 2 years ago

Can u paste the exact code u run and your modified content?

snaaz21 commented 2 years ago

app.py

from jina.types.document.generators import from_files
from jina import DocumentArray, Flow, Document

def check_query(resp):
    for d in resp.docs:
        print(f'input_audio: {d.uri}, len-of-chunks:{len(d.chunks)}')

        for m in d.matches:
            print(f' +- id: {m.id} score: {m.scores["cosine"].value:.6f}, {m.text}')

def main():
    doc_text = DocumentArray([Document(text='car_horn'),Document(text= "cat"), Document(text='alarm_clock'), Document(text='coughing'), Document(text='thunderstorm')])
    docs_test = DocumentArray(from_files('AudioCLIP/demo/audio/*.wav'))

    f = Flow.load_config('flow-index.yml')
    with f:
        f.post(on='/index', inputs=doc_text)
        print('Indexing completed...')
        # f.block()
    f = Flow.load_config(('flow-search.yml'))
    with f:
        print('Starting searching...')
        f.post(on='/search', inputs=docs_test, on_done=check_query)
        f.protocol = 'http'
        f.cors = True
        f.block()

if __name__ == '__main__':
    main()
snaaz21 commented 2 years ago

flow-search.yml

jtype: Flow
version: '1'
#with:
#  port_expose: 45678
executors:
  - name: 'audio_segmenter'
    uses: AudioSegmenter
    uses_with:
      window_size: 4
      stride: 2
    py_modules:
      - executors.py
  - name: 'audio_encoder'
    uses: AudioCLIPEncoder
    py_module : 'AudioCLIPEncoder/executor/audio_clip_encoder.py'
    uses_with:
      traversal_paths: 'r'
    needs: 'audio_segmenter'
  - name: 'audio_indexer'
    uses: SimpleIndexer
    py_modules: 'simple_indexer.py'
#    uses_with:
#      match_args:
#        limit: 60
#        traversal_rdarray: 'r'
#        traversal_ldarray: 'r'
    needs: 'audio_encoder'
  - name: 'ranker'
    uses: MyRanker
    py_modules:
      - executors.py
JoanFM commented 2 years ago

And what abou the flow-index?

snaaz21 commented 2 years ago

flow-index.yml

version: '1'
executors:
 - name: 'text_encoder'
   uses: AudioCLIPTextEncoder
   py_module : 'AudioClipTextEncoder/executor/audioclip_text.py'
   uses_with:
    traversal_paths: 'r'
 - name: 'text_indexer'
   uses: SimpleIndexer
   py_module : 'simple_indexer.py'
   needs: 'text_encoder'
snaaz21 commented 2 years ago

executors.py

import librosa as lr
import numpy as np
from collections import defaultdict

from jina import Document, DocumentArray, Executor, requests

class AudioSegmenter(Executor):
    def __init__(self, window_size: float = 1, stride: float = 1, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.window_size = window_size  # seconds
        self.stride = stride

    @requests(on=['/index', '/search'])
    def segment(self, docs: DocumentArray, **kwargs):
        for idx, doc in enumerate(docs):
            try:
                doc.blob, sample_rate = lr.load(doc.uri, sr=16000)
            except RuntimeError as e:
                print(f'failed to load {doc.uri}, {e}')
                continue
            doc.tags['sample_rate'] = sample_rate
            chunk_size = int(self.window_size * sample_rate)
            stride_size = int(self.stride * sample_rate)
            num_chunks = max(1, int((doc.blob.shape[0] - chunk_size) / stride_size))
            for chunk_id in range(num_chunks):
                beg = chunk_id * stride_size
                end = beg + chunk_size
                if beg > doc.blob.shape[0]:
                    break
                c = Document(
                    blob=doc.blob[beg:end],
                    offset=idx,
                    location=[beg, end],
                    tags=doc.tags,
                    uri=doc.uri
                )
                c.tags['beg_in_ms'] = beg / sample_rate * 1000
                c.tags['end_in_ms'] = end / sample_rate * 1000
                doc.chunks.append(c)

class MyRanker(Executor):
    @requests(on='/search')
    def rank(self, docs: DocumentArray = None, **kwargs):
        for doc in docs.traverse_flat('r',):
            parents_scores = defaultdict(list)
            parents_match = defaultdict(list)
            # print('*'*8)
            # print('doc: ', doc.tags['beg_in_ms'])
            # print('*' * 8)
            for m in DocumentArray([doc]).traverse_flat('m'):
                parents_scores[m.parent_id].append(m.scores['cosine'].value)
                parents_match[m.parent_id].append(m)
            # Aggregate match scores for parent document and
            # create doc's match based on parent document of matched chunks
            new_matches = []
            for match_parent_id, scores in parents_scores.items():
                #print('scores: ', scores)
                score_id = np.argmin(scores)
                score = scores[score_id]
                match = parents_match[match_parent_id][score_id]
                #print(f'match.uri: , {match.uri}, match.text: {match.text}')
                new_match = Document(
                    uri=match.uri,
                    id=match_parent_id,
                    text=match.text,
                    scores={'cosine': score})
                # new_match.tags['beg_in_ms'] = match.tags['beg_in_ms']
                # new_match.tags['end_in_ms'] = match.tags['end_in_ms']
                new_matches.append(new_match)
                #print(new_match)
            # Sort the matches
            doc.matches = new_matches
            doc.matches.sort(key=lambda d: d.scores['cosine'].value)
JoanFM commented 2 years ago

The most obvious thing that comes to my mind, is that you are segmenting your query, but then u are not using the semgneted audios to encode or search. You would need to adapt the traversal_paths of AudioCLIPEncoder and SimpleIndexer and Ranker for this

snaaz21 commented 2 years ago

I got it, sorry by mistake I commented "traversal_rdarray" and "traversal_ldarray" in the above flow-search.yml but I used it and had to give both as "r" since it was not working for "r","c" or "c", "c"

snaaz21 commented 2 years ago

I also tried searching without segmenting audio, by removing the segmenting script in the"class AudioSegmenter(Executor)" in the executors.py

like this: `class AudioSegmenter(Executor): def init(self, window_size: float = 1, stride: float = 1, *args, *kwargs): super().init(args, **kwargs) self.window_size = window_size # seconds self.stride = stride

@requests(on=['/index', '/search'])
def segment(self, docs: DocumentArray, **kwargs):
    for idx, doc in enumerate(docs):
        try:
            doc.blob, sample_rate = lr.load(doc.uri, sr=16000)
        except RuntimeError as e:
            print(f'failed to load {doc.uri}, {e}')
            continue
        doc.tags['sample_rate'] = sample_rate`

still getting same output.

input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/thunder_3-144891-B-19.wav, len-of-chunks:0 +- id: adc9d55c738611ec9a1b293f02f6f756 score: 0.931110, cat

input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/coughing_1-58792-A-24.wav, len-of-chunks:0 +- id: adc9e380738611ec9a1b293f02f6f756 score: 0.923869, cat

input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/cat_3-95694-A-5.wav, len-of-chunks:0 +- id: adc9ebaa738611ec9a1b293f02f6f756 score: 0.845475, cat

input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav, len-of-chunks:0 +- id: adc9f302738611ec9a1b293f02f6f756 score: 0.928283, cat

input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav, len-of-chunks:0 +- id: adc9fabe738611ec9a1b293f02f6f756 score: 0.919222, alarm_clock

JoanFM commented 2 years ago

What is the change of executors? Right now there is too much info. Can we get a clear view of what has been run (code, flows, and params) and the output and why is that wrong?

JoanFM commented 2 years ago

Hey @snaaz21 ,

is the issue solved?

snaaz21 commented 2 years ago

no, by mistake it closed

snaaz21 commented 2 years ago

I also tried searching without segmenting audio, by removing the segmenting script in the"class AudioSegmenter(Executor)" in the executors.py

like this: `class AudioSegmenter(Executor): def init(self, window_size: float = 1, stride: float = 1, *args, kwargs): super().init*(args, **kwargs) self.window_size = window_size # seconds self.stride = stride

@requests(on=['/index', '/search'])
def segment(self, docs: DocumentArray, **kwargs):
    for idx, doc in enumerate(docs):
        try:
            doc.blob, sample_rate = lr.load(doc.uri, sr=16000)
        except RuntimeError as e:
            print(f'failed to load {doc.uri}, {e}')
            continue
        doc.tags['sample_rate'] = sample_rate`

still getting same output.

input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/thunder_3-144891-B-19.wav, len-of-chunks:0 +- id: adc9d55c738611ec9a1b293f02f6f756 score: 0.931110, cat

input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/coughing_1-58792-A-24.wav, len-of-chunks:0 +- id: adc9e380738611ec9a1b293f02f6f756 score: 0.923869, cat

input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/cat_3-95694-A-5.wav, len-of-chunks:0 +- id: adc9ebaa738611ec9a1b293f02f6f756 score: 0.845475, cat

input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav, len-of-chunks:0 +- id: adc9f302738611ec9a1b293f02f6f756 score: 0.928283, cat

input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav, len-of-chunks:0 +- id: adc9fabe738611ec9a1b293f02f6f756 score: 0.919222, alarm_clock

I edited this comment please check this once

JoanFM commented 2 years ago

With the Ranker you have right now, the ranker is not needed anymore.

You can simply remove it from the search flow it seems. These ranker makes no sense if r and m are used as traversal_path.

snaaz21 commented 2 years ago

thank you for reply.

what should I do for matching or ranking scores?

JoanFM commented 2 years ago

if the match happens at root level, no need to rank again, the indexer will returned sorted by similarity

snaaz21 commented 2 years ago

ok, I am not using MyRanker now.

I simply printed the result as below;

input_audio: AudioCLIP/demo/audio/thunder_3-144891-B-19.wav score: 0.931110, cat score: 0.937309, car_horn score: 0.938694, thunderstorm score: 0.947851, alarm_clock score: 0.950527, coughing

input_audio: AudioCLIP/demo/audio/coughing_1-58792-A-24.wav score: 0.923869, cat score: 0.934698, car_horn score: 0.937990, coughing score: 0.942766, alarm_clock score: 0.943704, thunderstorm

input_audio: AudioCLIP/demo/audio/cat_3-95694-A-5.wav score: 0.845475, cat score: 0.967198, car_horn score: 0.981880, thunderstorm score: 0.994282, alarm_clock score: 1.016039, coughing

input_audio: AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav score: 0.928283, cat score: 0.928477, car_horn score: 0.938940, alarm_clock score: 0.945670, thunderstorm score: 0.950001, coughing

input_audio: AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav score: 0.919222, alarm_clock score: 0.929852, cat score: 0.932463, car_horn score: 0.945655, thunderstorm score: 0.949792, coughing

As you can see above, the result is not much accurate for the input audios, right?, did I do anything wrong while indexing or something else ?

JoanFM commented 2 years ago

So just to make sure, here u are not applying any segmentation right?

JoanFM commented 2 years ago

I think the results is just a problem of the model not performing as well as you expect.

snaaz21 commented 2 years ago

So just to make sure, here u are not applying any segmentation right?

yes, not applying segmentation

JoanFM commented 2 years ago

then this is the problem of the encoder.

You can try if applying segmentation works better.

snaaz21 commented 2 years ago

Same result is coming by applying segmentation also.

JoanFM commented 2 years ago

Also maybe is interesting to split car_horn into car horn and alarm_clock into alarm clock? I am not sure if AudioCLIPTextEncoder would otherwise tokenize the sentences properly

snaaz21 commented 2 years ago

Also maybe is interesting to split car_horn into car horn and alarm_clock into alarm clock? I am not sure if AudioCLIPTextEncoder would otherwise tokenize the sentences properly

ok, I'll try this way too.

snaaz21 commented 2 years ago

Also maybe is interesting to split car_horn into car horn and alarm_clock into alarm clock? I am not sure if AudioCLIPTextEncoder would otherwise tokenize the sentences properly

ok, I'll try this way too.

Here is the result:

input_audio: AudioCLIP/demo/audio/thunder_3-144891-B-19.wav score: 0.930757, car horn score: 0.931110, cat score: 0.938694, thunderstorm score: 0.950527, coughing score: 0.955934, alarm clock

input_audio: AudioCLIP/demo/audio/coughing_1-58792-A-24.wav score: 0.923869, cat score: 0.931542, car horn score: 0.937990, coughing score: 0.943704, thunderstorm score: 0.943861, alarm clock

input_audio: AudioCLIP/demo/audio/cat_3-95694-A-5.wav score: 0.845475, cat score: 0.968182, car horn score: 0.981880, thunderstorm score: 1.000134, alarm clock score: 1.016039, coughing

input_audio: AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav score: 0.923857, car horn score: 0.928283, cat score: 0.945670, thunderstorm score: 0.949575, alarm clock score: 0.950001, coughing

input_audio: AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav score: 0.929285, car horn score: 0.929852, cat score: 0.931609, alarm clock score: 0.945655, thunderstorm score: 0.949792, coughing

JoanFM commented 2 years ago

What is weird is that all the cosine distances are so large.

JoanFM commented 2 years ago

What is simple_indexer.py?

JoanFM commented 2 years ago

can u maybe sare ur code as a zip so that I can easily reproduce?

JoanFM commented 2 years ago

or have a github repo from urself where i can easily clone and reproduce?

snaaz21 commented 2 years ago

What is weird is that all the cosine distances are so large.

yeah

snaaz21 commented 2 years ago

What is simple_indexer.py?

https://github.com/jina-ai/executor-simpleindexer/blob/v0.10/executor.py'

snaaz21 commented 2 years ago

or have a github repo from urself where i can easily clone and reproduce?

I don't have a repo, so I am uploading in zip, attached below: Text-By-Audio.zip

JoanFM commented 2 years ago

By adding these lines to the check_query u will see that it seems to be working correctly:

def check_query(resp):
    from scipy.spatial import distance
    for d in resp.docs:
        print(f'input_audio: {d.uri}, len-of-chunks:{len(d.chunks)}')
        query_embedding = d.embedding
        for m in d.matches:
            print(f'score: {m.scores["cosine"].value:.6f}, {m.text}')
            match_embedding = m.embedding
            print(f' distance {distance.cosine(query_embedding, match_embedding)}')
JoanFM commented 2 years ago

So the problem is that the encoder does not seem to work well for this. Maybe u have to start using properly the segmentation

snaaz21 commented 2 years ago

So the problem is that the encoder does not seem to work well for this. Maybe u have to start using properly the segmentation

oh ok, I didn't get properly segmentation, means?

JoanFM commented 2 years ago

exactly, u are just encoding the complete audio in one vector

snaaz21 commented 2 years ago

exactly, u are just encoding the complete audio in one vector

yeah, but for segmented audio also getting same result

JoanFM commented 2 years ago

How do u change the Flow to take into account for this?

snaaz21 commented 2 years ago

when I was removing segmenter from flow, it was asking for sample rate value:

So then I didn't remove the segmenter from flow-search.yml instead I removed the segmenting script from AudioSegmenter class as below where i am just passing the sample-rate there along with blob of audio.

`class AudioSegmenter(Executor): def init(self, window_size: float = 1, stride: float = 1, *args, *kwargs): super().init(args, **kwargs) self.window_size = window_size # seconds self.stride = stride

@requests(on=['/index', '/search'])
def segment(self, docs: DocumentArray, **kwargs):
    for idx, doc in enumerate(docs):
        try:
            doc.blob, sample_rate = lr.load(doc.uri, sr=16000)
        except RuntimeError as e:
            print(f'failed to load {doc.uri}, {e}')
            continue
        doc.tags['sample_rate'] = sample_rate`
snaaz21 commented 2 years ago

when I was removing segmenter from flow, it was asking for sample rate value:

So then I didn't remove the segmenter from flow-search.yml instead I removed the segmenting script from AudioSegmenter class as below where i am just passing the sample-rate there along with blob of audio.

`class AudioSegmenter(Executor): def init(self, window_size: float = 1, stride: float = 1, *args, kwargs): super().init*(args, **kwargs) self.window_size = window_size # seconds self.stride = stride

@requests(on=['/index', '/search'])
def segment(self, docs: DocumentArray, **kwargs):
    for idx, doc in enumerate(docs):
        try:
            doc.blob, sample_rate = lr.load(doc.uri, sr=16000)
        except RuntimeError as e:
            print(f'failed to load {doc.uri}, {e}')
            continue
        doc.tags['sample_rate'] = sample_rate`

this is for without segmenting audio

snaaz21 commented 2 years ago

for segmenting i was using Ranker also

JoanFM commented 2 years ago

when I was removing segmenter from flow, it was asking for sample rate value:

So then I didn't remove the segmenter from flow-search.yml instead I removed the segmenting script from AudioSegmenter class as below where i am just passing the sample-rate there along with blob of audio.

`class AudioSegmenter(Executor): def init(self, window_size: float = 1, stride: float = 1, *args, kwargs): super().init*(args, **kwargs) self.window_size = window_size # seconds self.stride = stride

@requests(on=['/index', '/search'])
def segment(self, docs: DocumentArray, **kwargs):
    for idx, doc in enumerate(docs):
        try:
            doc.blob, sample_rate = lr.load(doc.uri, sr=16000)
        except RuntimeError as e:
            print(f'failed to load {doc.uri}, {e}')
            continue
        doc.tags['sample_rate'] = sample_rate`

This comes from the fact that segmenter does 2 things. Loading the audio and segmenting. But in ur Flow u are only using the Loaded audio, but chunks are not used for anything

JoanFM commented 2 years ago

Hey @snaaz21 ,

Have you been able to fix this?

snaaz21 commented 2 years ago

Hey @snaaz21 ,

Have you been able to fix this?

No, I didn't.

Since I am using very short audios like 5-6 sec size of audios. So I am not chunking it. I am confused why its not retrieving more correctly. It should retrieve correctly, right?

JoanFM commented 2 years ago

It depends on the quality of the encoder. I am not sure what it was designed for. You may want to check AudioCLIP paper to find out more?

snaaz21 commented 2 years ago

yeah, it could be by encoder because I also tried Image retrieval from audio faced same thing. Like output is attached below;

input_audio: AudioCLIP/demo/audio/thunder_3-144891-B-19.wav, len-of-chunks:0 score: 0.849588, AudioCLIP/demo/images/clock_1.jpg score: 0.849691, AudioCLIP/demo/images/cars_2.jpg score: 0.859069, AudioCLIP/demo/images/cat_1.jpg score: 0.866946, AudioCLIP/demo/images/coughing_1.jpg score: 0.867610, AudioCLIP/demo/images/lightning_2.jpg score: 0.882231, AudioCLIP/demo/images/cat_2.jpg score: 0.893308, AudioCLIP/demo/images/clock_2.jpg score: 0.896258, AudioCLIP/demo/images/lightning_1.jpg score: 0.900182, AudioCLIP/demo/images/coughing_2.jpg score: 0.907948, AudioCLIP/demo/images/cars_1.jpg

input_audio: AudioCLIP/demo/audio/coughing_1-58792-A-24.wav, len-of-chunks:0 score: 0.829560, AudioCLIP/demo/images/cars_2.jpg score: 0.842407, AudioCLIP/demo/images/cat_1.jpg score: 0.846685, AudioCLIP/demo/images/clock_1.jpg score: 0.846998, AudioCLIP/demo/images/lightning_2.jpg score: 0.859228, AudioCLIP/demo/images/cat_2.jpg score: 0.866479, AudioCLIP/demo/images/coughing_1.jpg score: 0.869724, AudioCLIP/demo/images/clock_2.jpg score: 0.870975, AudioCLIP/demo/images/lightning_1.jpg score: 0.882659, AudioCLIP/demo/images/coughing_2.jpg score: 0.888113, AudioCLIP/demo/images/cars_1.jpg

input_audio: AudioCLIP/demo/audio/cat_3-95694-A-5.wav, len-of-chunks:0 score: 0.714194, AudioCLIP/demo/images/cat_1.jpg score: 0.736080, AudioCLIP/demo/images/cat_2.jpg score: 0.830958, AudioCLIP/demo/images/clock_1.jpg score: 0.842192, AudioCLIP/demo/images/coughing_1.jpg score: 0.864808, AudioCLIP/demo/images/lightning_2.jpg score: 0.879201, AudioCLIP/demo/images/cars_2.jpg score: 0.891341, AudioCLIP/demo/images/lightning_1.jpg score: 0.894358, AudioCLIP/demo/images/clock_2.jpg score: 0.913241, AudioCLIP/demo/images/coughing_2.jpg score: 0.916362, AudioCLIP/demo/images/cars_1.jpg

input_audio: AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav, len-of-chunks:0 score: 0.830208, AudioCLIP/demo/images/cat_1.jpg score: 0.833510, AudioCLIP/demo/images/cars_2.jpg score: 0.840375, AudioCLIP/demo/images/clock_1.jpg score: 0.845739, AudioCLIP/demo/images/coughing_1.jpg score: 0.850994, AudioCLIP/demo/images/cat_2.jpg score: 0.854412, AudioCLIP/demo/images/lightning_2.jpg score: 0.874402, AudioCLIP/demo/images/clock_2.jpg score: 0.876907, AudioCLIP/demo/images/coughing_2.jpg score: 0.884458, AudioCLIP/demo/images/lightning_1.jpg score: 0.887085, AudioCLIP/demo/images/cars_1.jpg

input_audio: AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav, len-of-chunks:0 score: 0.812014, AudioCLIP/demo/images/clock_1.jpg score: 0.832325, AudioCLIP/demo/images/cars_2.jpg score: 0.835040, AudioCLIP/demo/images/coughing_1.jpg score: 0.835898, AudioCLIP/demo/images/cat_1.jpg score: 0.849695, AudioCLIP/demo/images/cat_2.jpg score: 0.859725, AudioCLIP/demo/images/lightning_2.jpg score: 0.864214, AudioCLIP/demo/images/clock_2.jpg score: 0.873641, AudioCLIP/demo/images/coughing_2.jpg score: 0.879688, AudioCLIP/demo/images/lightning_1.jpg score: 0.889060, AudioCLIP/demo/images/cars_1.jpg

snaaz21 commented 2 years ago

It depends on the quality of the encoder. I am not sure what it was designed for. You may want to check AudioCLIP paper to find out more?

Ok, I'll try with the paper.

And please guide me more on it. I'll update you once I get something from paper.

Thank you

JoanFM commented 2 years ago

We will try to keep an eye to see if there is something about Jina affecting, but I think the problem is on the encoder itself