linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
1.87k stars 150 forks source link

Issue with large-v3 model testing and channel mismatch error during audio processing #132

Closed Nondzu closed 10 months ago

Nondzu commented 10 months ago

🐛 Bug Report

Description

📝 During the testing phase of the large-v3 model, a channel mismatch error occurred immediately after the progress bar reached completion.

Progress and Error Details

✅ The progress bar reported successful completion:

Progress: 662363/662363 (100.00%)

Error and Stack Trace

🚫 The error surfaced post completion:

An error occurred while processing audio : Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000] to have 128 channels, but got 80 channels instead

Requested Action

🔍 I request an examination of the error post-progress completion, particularly the channel size expectation mismatch in the audio processing component.

raivisdejus commented 10 months ago

I also get this error. I think the internal structure of v3 whisper models differs, so we get the error.

Jeronymous commented 10 months ago

Thanks for notifying. The number of mel features changed with large-v3.

So decoding was failing with non some default options

Nondzu commented 10 months ago

@Jeronymous Thank you !

I'll try it today