sandrohanea / whisper.net

Whisper.net. Speech to text made simple using Whisper Models
MIT License
534 stars 82 forks source link

Invalid wave file header when using Win11 Recorder #35

Closed vyrotek closed 1 year ago

vyrotek commented 1 year ago

I was able to get the demo code running using the Kennedy.wav file. But when I recorded a file using the Windows 11 Recorder it said the wave file header was invalid.

Whisper.net.Wave.CorruptedWaveException: 'Invalid wave file header.'

Windows 11 Sound Recorder can generate Wav files of various qualities.

image

I took your suggestion from Issue #33 and wrote out the headers for each quality level.

Kennedy.wav: RIFF?¶WAVEfmt
Auto.wav: RIFF??☻WAVEJUNK
Medium.wav: RIFFJ?☺WAVEJUNK
Best.wav: RIFFB?WAVEJUNK
High.wav: RIFFv?♠WAVEJUNK

I would have expected these files to be valid. Is there something I'm missing?

martinmueller4voice commented 1 year ago

It's important to understand that "WAV" is not an audio format but merely a container, made up from several "chunks". One of those chunks (usually the first one after the header) is the "fmt" chunk where you can find the format of the audio data inside the container (usually PCM, but not neccessary). As you have found out, Kennedy.wav has this fmt chunk and the other ones have a "JUNK" chunk. Could be that Win11 audio recorder adds the format chunk somewhere else (I'm not sure if the WAV specification demands the "fmt" chunk to be the first one) and the WAV reading library is not expecting this - most implementations I know have the format chunk offset hardcoded more or less.

You could use Audacity for your recordings; it abides to the standard.

sandrohanea commented 1 year ago

Hello @vyrotek, Indeed, there is a problem with that JUNK chunk as @martinmueller4voice pointed out.

The problem is in https://github.com/sandrohanea/whisper.net/blob/b397baa30ae11ede6110dd764c5e2b44a5793bcc/Whisper.net/Wave/WaveParser.cs#L156

As you can see, the WaveParser expects that after WAVE header, fmt chunk is there and then it passes any number of chunks until it finds "data": https://github.com/sandrohanea/whisper.net/blob/b397baa30ae11ede6110dd764c5e2b44a5793bcc/Whisper.net/Wave/WaveParser.cs#L207

The fix would be to change that fmt chunk parsing to be added in the while and ensure it was found before the data, but ignore any other chunks. Will try to fix it next weekend, but if anyone wants to do it in the meantime, it would be a good first issue.

Here is a recording with this JUNK chunk:

Recording.zip

sandrohanea commented 1 year ago

Hello @vyrotek , I fixed the junk chunk parsing in https://github.com/sandrohanea/whisper.net/pull/39 However, just using Win11 Recorder with the setting you provided, will still fail as whisper models are trained only on 16khz data, and the Audio Quality "High" will produce 41Khz.

Whisper.net cannot, for now convert these 41khz to 16khz (even tho it might be able in the future, as it doesn't sound too complicated).

These are the settings that now works after PR 39 will be merged and released: image