Closed snaaz21 closed 2 years ago
Why do you consider these outputs not correct? what are the expected result and its score?
correct outputs means its not retrieving the exact sequence output, for "coughing.wav" it should give text "coughing" first like that. above are the minimum cosine similarity scores.
I am new in this. please do correct me where did i do wrong?
Can u paste the exact code u run and your modified content?
app.py
from jina.types.document.generators import from_files
from jina import DocumentArray, Flow, Document
def check_query(resp):
for d in resp.docs:
print(f'input_audio: {d.uri}, len-of-chunks:{len(d.chunks)}')
for m in d.matches:
print(f' +- id: {m.id} score: {m.scores["cosine"].value:.6f}, {m.text}')
def main():
doc_text = DocumentArray([Document(text='car_horn'),Document(text= "cat"), Document(text='alarm_clock'), Document(text='coughing'), Document(text='thunderstorm')])
docs_test = DocumentArray(from_files('AudioCLIP/demo/audio/*.wav'))
f = Flow.load_config('flow-index.yml')
with f:
f.post(on='/index', inputs=doc_text)
print('Indexing completed...')
# f.block()
f = Flow.load_config(('flow-search.yml'))
with f:
print('Starting searching...')
f.post(on='/search', inputs=docs_test, on_done=check_query)
f.protocol = 'http'
f.cors = True
f.block()
if __name__ == '__main__':
main()
flow-search.yml
jtype: Flow
version: '1'
#with:
# port_expose: 45678
executors:
- name: 'audio_segmenter'
uses: AudioSegmenter
uses_with:
window_size: 4
stride: 2
py_modules:
- executors.py
- name: 'audio_encoder'
uses: AudioCLIPEncoder
py_module : 'AudioCLIPEncoder/executor/audio_clip_encoder.py'
uses_with:
traversal_paths: 'r'
needs: 'audio_segmenter'
- name: 'audio_indexer'
uses: SimpleIndexer
py_modules: 'simple_indexer.py'
# uses_with:
# match_args:
# limit: 60
# traversal_rdarray: 'r'
# traversal_ldarray: 'r'
needs: 'audio_encoder'
- name: 'ranker'
uses: MyRanker
py_modules:
- executors.py
And what abou the flow-index
?
flow-index.yml
version: '1'
executors:
- name: 'text_encoder'
uses: AudioCLIPTextEncoder
py_module : 'AudioClipTextEncoder/executor/audioclip_text.py'
uses_with:
traversal_paths: 'r'
- name: 'text_indexer'
uses: SimpleIndexer
py_module : 'simple_indexer.py'
needs: 'text_encoder'
executors.py
import librosa as lr
import numpy as np
from collections import defaultdict
from jina import Document, DocumentArray, Executor, requests
class AudioSegmenter(Executor):
def __init__(self, window_size: float = 1, stride: float = 1, *args, **kwargs):
super().__init__(*args, **kwargs)
self.window_size = window_size # seconds
self.stride = stride
@requests(on=['/index', '/search'])
def segment(self, docs: DocumentArray, **kwargs):
for idx, doc in enumerate(docs):
try:
doc.blob, sample_rate = lr.load(doc.uri, sr=16000)
except RuntimeError as e:
print(f'failed to load {doc.uri}, {e}')
continue
doc.tags['sample_rate'] = sample_rate
chunk_size = int(self.window_size * sample_rate)
stride_size = int(self.stride * sample_rate)
num_chunks = max(1, int((doc.blob.shape[0] - chunk_size) / stride_size))
for chunk_id in range(num_chunks):
beg = chunk_id * stride_size
end = beg + chunk_size
if beg > doc.blob.shape[0]:
break
c = Document(
blob=doc.blob[beg:end],
offset=idx,
location=[beg, end],
tags=doc.tags,
uri=doc.uri
)
c.tags['beg_in_ms'] = beg / sample_rate * 1000
c.tags['end_in_ms'] = end / sample_rate * 1000
doc.chunks.append(c)
class MyRanker(Executor):
@requests(on='/search')
def rank(self, docs: DocumentArray = None, **kwargs):
for doc in docs.traverse_flat('r',):
parents_scores = defaultdict(list)
parents_match = defaultdict(list)
# print('*'*8)
# print('doc: ', doc.tags['beg_in_ms'])
# print('*' * 8)
for m in DocumentArray([doc]).traverse_flat('m'):
parents_scores[m.parent_id].append(m.scores['cosine'].value)
parents_match[m.parent_id].append(m)
# Aggregate match scores for parent document and
# create doc's match based on parent document of matched chunks
new_matches = []
for match_parent_id, scores in parents_scores.items():
#print('scores: ', scores)
score_id = np.argmin(scores)
score = scores[score_id]
match = parents_match[match_parent_id][score_id]
#print(f'match.uri: , {match.uri}, match.text: {match.text}')
new_match = Document(
uri=match.uri,
id=match_parent_id,
text=match.text,
scores={'cosine': score})
# new_match.tags['beg_in_ms'] = match.tags['beg_in_ms']
# new_match.tags['end_in_ms'] = match.tags['end_in_ms']
new_matches.append(new_match)
#print(new_match)
# Sort the matches
doc.matches = new_matches
doc.matches.sort(key=lambda d: d.scores['cosine'].value)
The most obvious thing that comes to my mind, is that you are segmenting
your query, but then u are not using the semgneted
audios to encode or search. You would need to adapt the traversal_paths
of AudioCLIPEncoder
and SimpleIndexer
and Ranker
for this
I got it, sorry by mistake I commented "traversal_rdarray" and "traversal_ldarray" in the above flow-search.yml but I used it and had to give both as "r" since it was not working for "r","c" or "c", "c"
I also tried searching without segmenting audio, by removing the segmenting script in the"class AudioSegmenter(Executor)" in the executors.py
like this: `class AudioSegmenter(Executor): def init(self, window_size: float = 1, stride: float = 1, *args, *kwargs): super().init(args, **kwargs) self.window_size = window_size # seconds self.stride = stride
@requests(on=['/index', '/search'])
def segment(self, docs: DocumentArray, **kwargs):
for idx, doc in enumerate(docs):
try:
doc.blob, sample_rate = lr.load(doc.uri, sr=16000)
except RuntimeError as e:
print(f'failed to load {doc.uri}, {e}')
continue
doc.tags['sample_rate'] = sample_rate`
still getting same output.
input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/thunder_3-144891-B-19.wav, len-of-chunks:0 +- id: adc9d55c738611ec9a1b293f02f6f756 score: 0.931110, cat
input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/coughing_1-58792-A-24.wav, len-of-chunks:0 +- id: adc9e380738611ec9a1b293f02f6f756 score: 0.923869, cat
input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/cat_3-95694-A-5.wav, len-of-chunks:0 +- id: adc9ebaa738611ec9a1b293f02f6f756 score: 0.845475, cat
input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav, len-of-chunks:0 +- id: adc9f302738611ec9a1b293f02f6f756 score: 0.928283, cat
input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav, len-of-chunks:0 +- id: adc9fabe738611ec9a1b293f02f6f756 score: 0.919222, alarm_clock
What is the change of executors? Right now there is too much info. Can we get a clear view of what has been run (code, flows, and params) and the output and why is that wrong?
Hey @snaaz21 ,
is the issue solved?
no, by mistake it closed
I also tried searching without segmenting audio, by removing the segmenting script in the"class AudioSegmenter(Executor)" in the executors.py
like this: `class AudioSegmenter(Executor): def init(self, window_size: float = 1, stride: float = 1, *args, kwargs): super().init*(args, **kwargs) self.window_size = window_size # seconds self.stride = stride
@requests(on=['/index', '/search']) def segment(self, docs: DocumentArray, **kwargs): for idx, doc in enumerate(docs): try: doc.blob, sample_rate = lr.load(doc.uri, sr=16000) except RuntimeError as e: print(f'failed to load {doc.uri}, {e}') continue doc.tags['sample_rate'] = sample_rate`
still getting same output.
input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/thunder_3-144891-B-19.wav, len-of-chunks:0 +- id: adc9d55c738611ec9a1b293f02f6f756 score: 0.931110, cat
input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/coughing_1-58792-A-24.wav, len-of-chunks:0 +- id: adc9e380738611ec9a1b293f02f6f756 score: 0.923869, cat
input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/cat_3-95694-A-5.wav, len-of-chunks:0 +- id: adc9ebaa738611ec9a1b293f02f6f756 score: 0.845475, cat
input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav, len-of-chunks:0 +- id: adc9f302738611ec9a1b293f02f6f756 score: 0.928283, cat
input_audio: /home/saima/Projects/R_And_D/AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav, len-of-chunks:0 +- id: adc9fabe738611ec9a1b293f02f6f756 score: 0.919222, alarm_clock
I edited this comment please check this once
With the Ranker
you have right now, the ranker
is not needed anymore.
You can simply remove it from the search flow it seems. These ranker makes no sense if r
and m
are used as traversal_path
.
thank you for reply.
what should I do for matching or ranking scores?
if the match happens at root level, no need to rank again, the indexer will returned sorted by similarity
ok, I am not using MyRanker now.
I simply printed the result as below;
input_audio: AudioCLIP/demo/audio/thunder_3-144891-B-19.wav score: 0.931110, cat score: 0.937309, car_horn score: 0.938694, thunderstorm score: 0.947851, alarm_clock score: 0.950527, coughing
input_audio: AudioCLIP/demo/audio/coughing_1-58792-A-24.wav score: 0.923869, cat score: 0.934698, car_horn score: 0.937990, coughing score: 0.942766, alarm_clock score: 0.943704, thunderstorm
input_audio: AudioCLIP/demo/audio/cat_3-95694-A-5.wav score: 0.845475, cat score: 0.967198, car_horn score: 0.981880, thunderstorm score: 0.994282, alarm_clock score: 1.016039, coughing
input_audio: AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav score: 0.928283, cat score: 0.928477, car_horn score: 0.938940, alarm_clock score: 0.945670, thunderstorm score: 0.950001, coughing
input_audio: AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav score: 0.919222, alarm_clock score: 0.929852, cat score: 0.932463, car_horn score: 0.945655, thunderstorm score: 0.949792, coughing
As you can see above, the result is not much accurate for the input audios, right?, did I do anything wrong while indexing or something else ?
So just to make sure, here u are not applying any segmentation right?
I think the results is just a problem of the model not performing as well as you expect.
So just to make sure, here u are not applying any segmentation right?
yes, not applying segmentation
then this is the problem of the encoder.
You can try if applying segmentation works better.
Same result is coming by applying segmentation also.
Also maybe is interesting to split car_horn
into car horn
and alarm_clock
into alarm clock
? I am not sure if AudioCLIPTextEncoder would otherwise tokenize the sentences properly
Also maybe is interesting to split
car_horn
intocar horn
andalarm_clock
intoalarm clock
? I am not sure if AudioCLIPTextEncoder would otherwise tokenize the sentences properly
ok, I'll try this way too.
Also maybe is interesting to split
car_horn
intocar horn
andalarm_clock
intoalarm clock
? I am not sure if AudioCLIPTextEncoder would otherwise tokenize the sentences properlyok, I'll try this way too.
Here is the result:
input_audio: AudioCLIP/demo/audio/thunder_3-144891-B-19.wav score: 0.930757, car horn score: 0.931110, cat score: 0.938694, thunderstorm score: 0.950527, coughing score: 0.955934, alarm clock
input_audio: AudioCLIP/demo/audio/coughing_1-58792-A-24.wav score: 0.923869, cat score: 0.931542, car horn score: 0.937990, coughing score: 0.943704, thunderstorm score: 0.943861, alarm clock
input_audio: AudioCLIP/demo/audio/cat_3-95694-A-5.wav score: 0.845475, cat score: 0.968182, car horn score: 0.981880, thunderstorm score: 1.000134, alarm clock score: 1.016039, coughing
input_audio: AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav score: 0.923857, car horn score: 0.928283, cat score: 0.945670, thunderstorm score: 0.949575, alarm clock score: 0.950001, coughing
input_audio: AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav score: 0.929285, car horn score: 0.929852, cat score: 0.931609, alarm clock score: 0.945655, thunderstorm score: 0.949792, coughing
What is weird is that all the cosine distances are so large.
What is simple_indexer.py
?
can u maybe sare ur code as a zip so that I can easily reproduce?
or have a github repo from urself where i can easily clone and reproduce?
What is weird is that all the cosine distances are so large.
yeah
What is
simple_indexer.py
?
https://github.com/jina-ai/executor-simpleindexer/blob/v0.10/executor.py'
or have a github repo from urself where i can easily clone and reproduce?
I don't have a repo, so I am uploading in zip, attached below: Text-By-Audio.zip
By adding these lines to the check_query
u will see that it seems to be working correctly:
def check_query(resp):
from scipy.spatial import distance
for d in resp.docs:
print(f'input_audio: {d.uri}, len-of-chunks:{len(d.chunks)}')
query_embedding = d.embedding
for m in d.matches:
print(f'score: {m.scores["cosine"].value:.6f}, {m.text}')
match_embedding = m.embedding
print(f' distance {distance.cosine(query_embedding, match_embedding)}')
So the problem is that the encoder does not seem to work well for this. Maybe u have to start using properly the segmentation
So the problem is that the encoder does not seem to work well for this. Maybe u have to start using properly the segmentation
oh ok, I didn't get properly segmentation, means?
exactly, u are just encoding the complete audio in one vector
exactly, u are just encoding the complete audio in one vector
yeah, but for segmented audio also getting same result
How do u change the Flow to take into account for this?
when I was removing segmenter from flow, it was asking for sample rate value:
So then I didn't remove the segmenter from flow-search.yml instead I removed the segmenting script from AudioSegmenter class as below where i am just passing the sample-rate there along with blob of audio.
`class AudioSegmenter(Executor): def init(self, window_size: float = 1, stride: float = 1, *args, *kwargs): super().init(args, **kwargs) self.window_size = window_size # seconds self.stride = stride
@requests(on=['/index', '/search'])
def segment(self, docs: DocumentArray, **kwargs):
for idx, doc in enumerate(docs):
try:
doc.blob, sample_rate = lr.load(doc.uri, sr=16000)
except RuntimeError as e:
print(f'failed to load {doc.uri}, {e}')
continue
doc.tags['sample_rate'] = sample_rate`
when I was removing segmenter from flow, it was asking for sample rate value:
So then I didn't remove the segmenter from flow-search.yml instead I removed the segmenting script from AudioSegmenter class as below where i am just passing the sample-rate there along with blob of audio.
`class AudioSegmenter(Executor): def init(self, window_size: float = 1, stride: float = 1, *args, kwargs): super().init*(args, **kwargs) self.window_size = window_size # seconds self.stride = stride
@requests(on=['/index', '/search']) def segment(self, docs: DocumentArray, **kwargs): for idx, doc in enumerate(docs): try: doc.blob, sample_rate = lr.load(doc.uri, sr=16000) except RuntimeError as e: print(f'failed to load {doc.uri}, {e}') continue doc.tags['sample_rate'] = sample_rate`
this is for without segmenting audio
for segmenting i was using Ranker also
when I was removing segmenter from flow, it was asking for sample rate value:
So then I didn't remove the segmenter from flow-search.yml instead I removed the segmenting script from AudioSegmenter class as below where i am just passing the sample-rate there along with blob of audio.
`class AudioSegmenter(Executor): def init(self, window_size: float = 1, stride: float = 1, *args, kwargs): super().init*(args, **kwargs) self.window_size = window_size # seconds self.stride = stride
@requests(on=['/index', '/search']) def segment(self, docs: DocumentArray, **kwargs): for idx, doc in enumerate(docs): try: doc.blob, sample_rate = lr.load(doc.uri, sr=16000) except RuntimeError as e: print(f'failed to load {doc.uri}, {e}') continue doc.tags['sample_rate'] = sample_rate`
This comes from the fact that segmenter does 2 things. Loading the audio and segmenting. But in ur Flow u are only using the Loaded audio, but chunks are not used for anything
Hey @snaaz21 ,
Have you been able to fix this?
Hey @snaaz21 ,
Have you been able to fix this?
No, I didn't.
Since I am using very short audios like 5-6 sec size of audios. So I am not chunking it. I am confused why its not retrieving more correctly. It should retrieve correctly, right?
It depends on the quality of the encoder
. I am not sure what it was designed for. You may want to check AudioCLIP paper to find out more?
yeah, it could be by encoder because I also tried Image retrieval from audio faced same thing. Like output is attached below;
input_audio: AudioCLIP/demo/audio/thunder_3-144891-B-19.wav, len-of-chunks:0 score: 0.849588, AudioCLIP/demo/images/clock_1.jpg score: 0.849691, AudioCLIP/demo/images/cars_2.jpg score: 0.859069, AudioCLIP/demo/images/cat_1.jpg score: 0.866946, AudioCLIP/demo/images/coughing_1.jpg score: 0.867610, AudioCLIP/demo/images/lightning_2.jpg score: 0.882231, AudioCLIP/demo/images/cat_2.jpg score: 0.893308, AudioCLIP/demo/images/clock_2.jpg score: 0.896258, AudioCLIP/demo/images/lightning_1.jpg score: 0.900182, AudioCLIP/demo/images/coughing_2.jpg score: 0.907948, AudioCLIP/demo/images/cars_1.jpg
input_audio: AudioCLIP/demo/audio/coughing_1-58792-A-24.wav, len-of-chunks:0 score: 0.829560, AudioCLIP/demo/images/cars_2.jpg score: 0.842407, AudioCLIP/demo/images/cat_1.jpg score: 0.846685, AudioCLIP/demo/images/clock_1.jpg score: 0.846998, AudioCLIP/demo/images/lightning_2.jpg score: 0.859228, AudioCLIP/demo/images/cat_2.jpg score: 0.866479, AudioCLIP/demo/images/coughing_1.jpg score: 0.869724, AudioCLIP/demo/images/clock_2.jpg score: 0.870975, AudioCLIP/demo/images/lightning_1.jpg score: 0.882659, AudioCLIP/demo/images/coughing_2.jpg score: 0.888113, AudioCLIP/demo/images/cars_1.jpg
input_audio: AudioCLIP/demo/audio/cat_3-95694-A-5.wav, len-of-chunks:0 score: 0.714194, AudioCLIP/demo/images/cat_1.jpg score: 0.736080, AudioCLIP/demo/images/cat_2.jpg score: 0.830958, AudioCLIP/demo/images/clock_1.jpg score: 0.842192, AudioCLIP/demo/images/coughing_1.jpg score: 0.864808, AudioCLIP/demo/images/lightning_2.jpg score: 0.879201, AudioCLIP/demo/images/cars_2.jpg score: 0.891341, AudioCLIP/demo/images/lightning_1.jpg score: 0.894358, AudioCLIP/demo/images/clock_2.jpg score: 0.913241, AudioCLIP/demo/images/coughing_2.jpg score: 0.916362, AudioCLIP/demo/images/cars_1.jpg
input_audio: AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav, len-of-chunks:0 score: 0.830208, AudioCLIP/demo/images/cat_1.jpg score: 0.833510, AudioCLIP/demo/images/cars_2.jpg score: 0.840375, AudioCLIP/demo/images/clock_1.jpg score: 0.845739, AudioCLIP/demo/images/coughing_1.jpg score: 0.850994, AudioCLIP/demo/images/cat_2.jpg score: 0.854412, AudioCLIP/demo/images/lightning_2.jpg score: 0.874402, AudioCLIP/demo/images/clock_2.jpg score: 0.876907, AudioCLIP/demo/images/coughing_2.jpg score: 0.884458, AudioCLIP/demo/images/lightning_1.jpg score: 0.887085, AudioCLIP/demo/images/cars_1.jpg
input_audio: AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav, len-of-chunks:0 score: 0.812014, AudioCLIP/demo/images/clock_1.jpg score: 0.832325, AudioCLIP/demo/images/cars_2.jpg score: 0.835040, AudioCLIP/demo/images/coughing_1.jpg score: 0.835898, AudioCLIP/demo/images/cat_1.jpg score: 0.849695, AudioCLIP/demo/images/cat_2.jpg score: 0.859725, AudioCLIP/demo/images/lightning_2.jpg score: 0.864214, AudioCLIP/demo/images/clock_2.jpg score: 0.873641, AudioCLIP/demo/images/coughing_2.jpg score: 0.879688, AudioCLIP/demo/images/lightning_1.jpg score: 0.889060, AudioCLIP/demo/images/cars_1.jpg
It depends on the quality of the
encoder
. I am not sure what it was designed for. You may want to check AudioCLIP paper to find out more?
Ok, I'll try with the paper.
And please guide me more on it. I'll update you once I get something from paper.
Thank you
We will try to keep an eye to see if there is something about Jina affecting, but I think the problem is on the encoder
itself
Hi,
I want to retrieve text by searching for an audio using AudioClip model.
First, I created indexing of text (car-horn, coughing, alarm-clock, thunderstorm etc) and using AudioClipTextEncoder for embedding.
After that I am searching for an audio where i am using AudioClipEncoder for embedding.
For both text and audio indexing simple-indexer is used.
While searching for an audio i created chunks using segmenter and using MyRanker for ranking scores where i modified some script.
The output is: input_audio: AudioCLIP/demo/audio/thunder_3-144891-B-19.wav, len-of-chunks:1 +- id: e5eceb26737311ec83fd2fdc0fc83b8e score: 0.931110, cat input_audio: AudioCLIP/demo/audio/coughing_1-58792-A-24.wav, len-of-chunks:1 +- id: e5ecf968737311ec83fd2fdc0fc83b8e score: 0.923869, cat input_audio: AudioCLIP/demo/audio/cat_3-95694-A-5.wav, len-of-chunks:1 +- id: e5ed00de737311ec83fd2fdc0fc83b8e score: 0.845475, cat input_audio: AudioCLIP/demo/audio/car_horn_1-24074-A-43.wav, len-of-chunks:1 +- id: e5ed0778737311ec83fd2fdc0fc83b8e score: 0.928283, cat input_audio: AudioCLIP/demo/audio/alarm_clock_3-120526-B-37.wav, len-of-chunks:1 +- id: e5ed0dcc737311ec83fd2fdc0fc83b8e score: 0.919222, alarm_clock
As above for 5 inputs only 2 outputs are correct. Please let me know why is that so ? Since AudioClipEncoder is giving correct outputs for all audios.