about the output of the apply

dalaolili commented 2 months ago

I got "SPEAKER 2b533492-bfa4-49f3-b01a-32546f6044bf_2 1 3.473 4.134 A " by run the command of "python brouhaha/main.py apply", but I can't understand the meaning of all the columns, so can you give me some message about them? What's more, how can I read the .npy file? Sincerely waiting for your reply.

LoannPeurey commented 1 month ago

I will try to give some leads, although I did not participate in the development so take my comments with a grain of salt.

For your first question, I suggest you look into the description of the RTTM file format(Annex A page 12) if you need details about the exact meaning of each column but as an overview it should be:

SPEAKER : the model identified speech from somebody (all lines here will be SPEAKER)
2b533492-bfa4-49f3-b01a-32546f6044bf_2 : name of the audio file where speech was found
1 : channel (here it should always be 1 as the model works with mono channel)
3.473 : timecode in seconds of where the speech was identified
4.134 : duration of the speech
A : label of the speaker, it should always be A here as I think the model is not trained to differentiate between speakers

So your line tells you speech was detected in file 2b533492-bfa4-49f3-b01a-32546f6044bf_2.wav from time 3.473s to time 7.607s (3.473 + 4.134), the rest is not really relevant.

You can read .npy files by using numpy in python:

import numpy as np
snr = np.load('detailed_snr_labels/2b533492-bfa4-49f3-b01a-32546f6044bf_2.npy')

The content should be snr values for each frame. Frames have a duration of 16.875 ms https://github.com/marianne-m/brouhaha-vad/issues/14#issuecomment-1802160502

dalaolili commented 1 month ago

get it！thanks for your reply！

---- Replied Message ---- | From | Loann @.> | | Date | 08/06/2024 21:17 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [marianne-m/brouhaha-vad] about the output of the apply (Issue #23) |

I will try to give some leads, although I did not participate in the development so take some of it with a grain of salt.

For your first question, I suggest you look into the description of the RTTM file format if you need details about the exact meaning of each column but as an overview it should be:

SPEAKER : the model identified speech from somebody (all lines here will be SPEAKER) 2b533492-bfa4-49f3-b01a-32546f6044bf_2 : name of the audio file where speech was found 1 : channel (here it should always be 1 as the model works with mono channel) 3.473 : timecode in seconds of where the speech was identified 4.134 : duration of the speech A : label of the speaker, it should always be A here as I think the model is not trained to differentiate between speakers

So your line tells you speech was detected in file 2b533492-bfa4-49f3-b01a-32546f6044bf_2.wav from time 3.473s to time 7.607s (3.473 + 4.134), the rest is not really relevant.

You can read .npy files by using numpy in python:

importnumpyasnpsnr=np.load('detailed_snr_labels/2b533492-bfa4-49f3-b01a-32546f6044bf_2.npy')

The content should be snr values for each frame which have durations of 16.875 ms #14 (comment)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

marianne-m / brouhaha-vad

about the output of the apply #23