I am unable to get it running on my machine (CPU)

ninjalu commented 1 year ago

Description

I installed whispering and followed the instructions, however I am not able to get any output. All I get is "No speech", which is clearly not right

Logs (Optional)

[2022-11-04 15:23:27,443] vad.__call__:56 DEBUG -> VAD: 0.010574953630566597 (threshold=0.5)
[2022-11-04 15:23:27,443] transcriber.transcribe:248 DEBUG -> No speech
[2022-11-04 15:23:27,443] transcriber.transcribe:258 DEBUG -> nosoeech_skip_count: None (<= 16)
[2022-11-04 15:23:27,443] cli.transcribe_from_mic:67 DEBUG -> Audio #: 7, The rest of queue: 0
[2022-11-04 15:23:31,274] cli.transcribe_from_mic:82 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-11-04 15:23:31,275] transcriber.transcribe:235 DEBUG -> 60000
[2022-11-04 15:23:31,310] vad.__call__:56 DEBUG -> VAD: 0.010565487667918205 (threshold=0.5)
[2022-11-04 15:23:31,310] transcriber.transcribe:248 DEBUG -> No speech
[2022-11-04 15:23:31,310] transcriber.transcribe:258 DEBUG -> nosoeech_skip_count: None (<= 16)
[2022-11-04 15:23:31,310] cli.transcribe_from_mic:67 DEBUG -> Audio #: 8, The rest of queue: 0
[2022-11-04 15:23:34,948] cli.transcribe_from_mic:82 DEBUG -> Got. The rest of queue: 0
Analyzing[2022-11-04 15:23:34,948] transcriber.transcribe:235 DEBUG -> 60000
[2022-11-04 15:23:34,979] vad.__call__:56 DEBUG -> VAD: 0.010574160143733025 (threshold=0.5)
[2022-11-04 15:23:34,979] transcriber.transcribe:248 DEBUG -> No speech
[2022-11-04 15:23:34,979] transcriber.transcribe:258 DEBUG -> nosoeech_skip_count: None (<= 16)
[2022-11-04 15:23:34,979] cli.transcribe_from_mic:67 DEBUG -> Audio #: 9, The rest of queue: 0

Environment

Mac M1

OS:
Python Version: 3.9
Whispering version: 0.6.3

shirayu commented 1 year ago

No speech is the output of VAD. How about to disable VAD to set --vad 0?

ninjalu commented 1 year ago

This is what I get with whispering --language en --model small --debug --vad 0

I don't know what I should be expecting, but I suspect some transcription of what I say to mic, but I only get repeated logs as below

Analyzing[2022-11-07 10:55:09,996] transcriber.transcribe:235 DEBUG -> 60000
[2022-11-07 10:55:09,998] transcriber.transcribe:266 DEBUG -> Incoming new_mel.shape: torch.Size([80, 375])
[2022-11-07 10:55:09,998] transcriber.transcribe:270 DEBUG -> buffer_mel.shape: torch.Size([80, 2250])
[2022-11-07 10:55:09,998] transcriber.transcribe:273 DEBUG -> mel.shape: torch.Size([80, 2625])
[2022-11-07 10:55:09,998] transcriber.transcribe:277 DEBUG -> seek: 0
[2022-11-07 10:55:09,998] transcriber.transcribe:282 DEBUG -> mel.shape (2625) - seek (0) < N_FRAMES (3000)
[2022-11-07 10:55:09,999] transcriber.transcribe:288 DEBUG -> No padding
[2022-11-07 10:55:09,999] transcriber.transcribe:345 DEBUG -> ctx.buffer_mel.shape: torch.Size([80, 2625])
[2022-11-07 10:55:09,999] cli.transcribe_from_mic:67 DEBUG -> Audio #: 7, The rest of queue: 0
[2022-11-07 10:55:13,824] cli.transcribe_from_mic:82 DEBUG -> Got. The rest of queue: 0

shirayu commented 1 year ago

How long have you waited? By the default, it needs to wait at least 30 seconds.

https://github.com/shirayu/whispering#parse-interval

By default, Whisper does not perform analysis until the total length of the segments determined by VAD to have speech exceeds 30 seconds. However, if silence segments appear 16 times (the default value of --max_nospeech_skip) after speech is detected, the analysis is performed.

ninjalu commented 1 year ago

Thanks! I got it running now. However, I noticed the transcription gets repeated (corrected?) for 4-5 timestamp intervals before moving on to the next chunk. Is that expected? Is there a way you could only allow one output from all the different versions?


136.98->139.06   long you will discover in fact that it's
139.06->141.60   not possible because before long you
141.60->143.28   will discover in fact that it there's not
143.28->145.52   possible. Because before long you will
145.52->146.98   discover in fact that that there's not
146.98->149.00   possible. Because before long you
149.00->150.64   will discover in fact that that there's
150.64->152.82   not possible. Because before long you
152.82->154.38   will discover in fact that it there's
154.38->156.36   not possible. Because before long you
156.36->158.14   will discover in fact that it there's
158.14->160.20   not possible. Because before long you
160.20->166.20   you will discover it is very well possible that
166.20->171.20   then it is very much stuff and we're not just going to know we are going to release stuff
171.20->175.20   And we are not just going to know we are going to release stuff
175.20->179.20   and we are not just going to know we are going to release stuff
179.20->182.20   And we are not going to know we are going to release stuff
182.20->186.20   and we are not just going to know we are going to release stuff
186.20->190.20   and we are not just going to know we are going to release stuff
190.20->193.70   We are not just going to know we are going to release stuff
193.70->197.30   Now you can say Good Fear, openly I didn't know at the moment.```

Thanks!

shirayu commented 1 year ago

Does the original whisper work with the sound? If not, it might be related to the representation problem that is reported here.

ninjalu commented 1 year ago

Thanks! That does explain some of the observations of repetition I have! Another question is regarding to your earlier commend about 30secs.

By default, Whisper does not perform analysis until the total length of the segments determined by VAD to have speech exceeds 30 seconds.

Is there anyway, I could reduce the 30 second rule (maybe just 10 secs), so it performs even more like streaming?

Many thanks!

shirayu commented 1 year ago

Yes you can. I added --frame option in whispering v0.6.4.

The default value is 3000 (i.e. 30 seconds) and you can make it smaller. However, it will sacrifice accuracy because this is not expected input for Whisper.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 21 days with no activity.

lvnilesh commented 1 year ago

What was the command that you got running on the m1 mac?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 21 days with no activity.

shirayu / whispering