sandrohanea / whisper.net

Whisper.net. Speech to text made simple using Whisper Models
MIT License
505 stars 77 forks source link

Whisper.net gives blank output after upgrading from 1.4.7 to 1.5.0 #176

Closed Anurag-RTS closed 4 months ago

Anurag-RTS commented 4 months ago

After upgrading to v1.5.0, I can no longer elicit any output using Whisper.net whereas it used to work flawlessly. Nothing has been changed in the POC env other than Whisper.net library version. Also, I noticed that I don't see Whisper debug logs anymore, but that's probably unrelated to this issue.

Converting from '.MP3' to WAV
Time Taken to init Whisper: 00:00:02.324
⟫ Starting Whisper processing...
⟫ Completed Whisper processing...

Program.cs (slightly modified from examples/Simple/Program.cs)

using System.Text.Json;

using FFMpegCore;
using FFMpegCore.Enums;
using FFMpegCore.Extend;
using NAudio.Wave;
using NAudio.Wave.SampleProviders;
using Whisper.net;
using Whisper.net.Ggml;

const string modelName = "ggml-medium.bin";
const GgmlType ggmlType = GgmlType.Medium;

var basePath = Directory.GetCurrentDirectory();
var modelsDir = Path.Combine(basePath, "Models");
var sampleDir = Path.Combine(basePath, "Samples");
var modelPath = Path.Combine(modelsDir, modelName);
var srcPath = Path.Combine(sampleDir, "ENGLISH SPEECH | EMMA WATSON: Gender Equality (English Subtitles) [nIwU-9ZTTJc].mp3"); // From https://youtu.be/nIwU-9ZTTJc
var destPath = srcPath.Replace(Path.GetExtension(srcPath), ".wav");

if (!File.Exists(srcPath))
  throw new FileNotFoundException(srcPath);

var mediaInfo = await FFProbe.AnalyseAsync(srcPath);
if (mediaInfo.ErrorData.Count > 0)
{
  Console.WriteLine("Error: " + JsonSerializer.Serialize(mediaInfo.ErrorData));
  return;
}

if (!File.Exists(destPath) && mediaInfo.AudioStreams.Count > 0)
{
  Console.WriteLine("Converting from '{0}' to '.WAV'", Path.GetExtension(srcPath).ToUpper());
  // equivalent to ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 2 output.wav
  FFMpegArguments
    .FromFileInput(srcPath)
    .OutputToFile(
      destPath,
      true,
      opts =>
        opts
          .DisableChannel(Channel.Video)
          .WithAudioCodec("pcm_s16le")
          .WithAudioSamplingRate(16000)
          .WithCustomArgument("-ac 2")
          .WithFastStart())
    .ProcessSynchronously();
}

DateTime startTime;
TimeSpan timeTaken;

if (!File.Exists(modelPath))
{
  startTime = DateTime.UtcNow;
  await DownloadModelAsync(ggmlType, modelName, modelsDir);
  timeTaken = DateTime.UtcNow - startTime;
  Console.WriteLine("Time Taken to Download: {0}", timeTaken.ToLongString());
}

startTime = DateTime.UtcNow;

// This section creates the whisperFactory object which is used to create the processor object.
using var whisperFactory = WhisperFactory.FromPath(modelPath);

// This section creates the processor object which is used to process the audio file, it uses language `auto` to detect the language of the audio file.
await using var processor = whisperFactory.CreateBuilder()
                                          .WithLanguage("en")
                                          .WithSpeedUp2x()
                                          .WithThreads(16)
                                          //.WithPrompt(prompt)
                                          .Build();

timeTaken = DateTime.UtcNow - startTime;
Console.WriteLine("Time Taken to init Whisper: {0}", timeTaken.ToLongString());

using var wavStream = new MemoryStream();

if (destPath.EndsWith(".mp3"))
{
  startTime = DateTime.UtcNow;

  // This section opens the mp3 file and converts it to a wav file with 16Khz sample rate.
  await using var fileStream = File.OpenRead(destPath);

  await using var reader = new Mp3FileReader(fileStream);
  var resampler = new WdlResamplingSampleProvider(reader.ToSampleProvider(), 16000);
  WaveFileWriter.WriteWavFileToStream(wavStream, resampler.ToWaveProvider16());

  timeTaken = DateTime.UtcNow - startTime;
  Console.WriteLine("Time Taken to convert MP3: {0}", timeTaken.ToLongString());
}
else
{
  await using var fileStream = File.OpenRead(destPath);
  await fileStream.CopyToAsync(wavStream);
}

// This section sets the wavStream to the beginning of the stream. (This is required because the wavStream was written to in the previous section)
wavStream.Seek(0, SeekOrigin.Begin);

Console.WriteLine("⟫ Starting Whisper processing...");

startTime = DateTime.UtcNow;

// This section processes the audio file and prints the results (start time, end time and text) to the console.
await foreach (var result in processor.ProcessAsync(wavStream))
{
  timeTaken = DateTime.UtcNow - startTime;
  Console.WriteLine($"{result.Start.ToLongString()}-->{result.End.ToLongString()}: {result.Text,-150} [{timeTaken.ToLongString()}]");
  startTime = DateTime.UtcNow;
}

Console.WriteLine("⟫ Completed Whisper processing...");

////

async Task DownloadModelAsync(GgmlType modelType, string modelFileName, string targetModelsDir)
{
  Console.WriteLine($"Model {modelName} not found. Downloading...");
  await using var modelStream = await WhisperGgmlDownloader.GetGgmlModelAsync(modelType);
  await using var fileWriter = File.OpenWrite(Path.Combine(targetModelsDir, modelName));
  await modelStream.CopyToAsync(fileWriter);
  Console.WriteLine($"Model {modelName} downloaded to {targetModelsDir}");
}
Sing303 commented 4 months ago

Remove WithSpeedUp2x, there is no implementation in whisper.cpp anymore, it will always produce an empty result. Support will be added in the future, it was removed because it degraded the quality a lot

Anurag-RTS commented 4 months ago

@Sing303 Yeah, it works... BUT, that really tanked the performance 😞 From 54s to fricking 47m20s!

Time Taken to init Whisper: 00:00:00.905
⟫ Starting Whisper processing...
00:00:00.000-->00:00:10.000:  [MUSIC]                                                                                                                                               [00:07:25.810]
00:00:10.000-->00:00:16.840:  I was appointed six months ago.                                                                                                                       [00:00:00.000]
00:00:16.840-->00:00:19.760:  And the more I've spoken about feminism,                                                                                                              [00:00:00.000]
00:00:19.760-->00:00:24.600:  the more I have realized that fighting for women's rights                                                                                             [00:00:00.000]
00:00:24.600-->00:00:29.960:  has too often become synonymous with man-hating.                                                                                                      [00:00:00.000]
00:00:29.960-->00:00:34.960:  If there is one thing I know for certain,                                                                                                             [00:08:04.272]
00:00:34.960-->00:00:38.960:  it is that this has to stop.                                                                                                                          [00:00:00.000]
00:00:38.960-->00:00:45.960:  For the record, feminism by definition is the belief                                                                                                  [00:00:00.000]
00:00:45.960-->00:00:51.960:  that men and women should have equal rights and opportunities.                                                                                        [00:00:00.000]
00:00:51.960-->00:00:56.960:  It is the theory of the political, economic,                                                                                                          [00:00:00.000]
00:00:56.960-->00:01:01.960:  and social equality of the sexes.                                                                                                                     [00:07:39.712]
00:01:01.960-->00:01:05.960:  I started questioning gender-based assumptions a long time ago.                                                                                       [00:00:00.000]
00:01:05.960-->00:01:11.960:  When I was eight, I was confused being called bossy,                                                                                                  [00:00:00.000]
00:01:11.960-->00:01:16.960:  because I wanted to direct the plays that we would put on for our parents.                                                                            [00:00:00.000]
00:01:16.960-->00:01:19.960:  But the boys were not.                                                                                                                                [00:00:00.000]
00:01:19.960-->00:01:25.960:  When at 14, I started to be sexualized by certain elements of the media.                                                                              [00:00:00.000]
00:01:25.960-->00:01:31.960:  When at 15, my girlfriends started dropping out of their beloved sports teams,                                                                        [00:04:40.371]
00:01:31.960-->00:01:34.960:  because they didn't want to appear muscly.                                                                                                            [00:00:00.000]
00:01:34.960-->00:01:42.960:  When at 18, my male friends were unable to express their feelings.                                                                                    [00:00:00.000]
00:01:42.960-->00:01:49.960:  I decided that I was a feminist, and this seemed uncomplicated to me.                                                                                 [00:00:00.000]
00:01:49.960-->00:01:58.960:  But my recent research has shown me that feminism has become an unpopular word.                                                                       [00:04:01.953]
00:01:58.960-->00:02:06.960:  Women are choosing not to identify as feminist.                                                                                                       [00:00:00.000]
00:02:06.960-->00:02:18.960:  Apparently, I am among the ranks of women whose expressions are seen as too strong, too aggressive.                                                   [00:00:00.000]
00:02:18.960-->00:02:23.960:  Isolating and anti-men.                                                                                                                               [00:04:03.947]
00:02:23.960-->00:02:27.960:  Unattractive, even.                                                                                                                                   [00:00:00.000]
00:02:27.960-->00:02:35.960:  Why has the word become such an uncomfortable one?                                                                                                    [00:00:00.000]
00:02:35.960-->00:02:44.960:  I am from Britain, and I think it is right that I am paid the same as my male counterparts.                                                           [00:00:00.000]
00:02:44.960-->00:02:50.960:  I think it is right that I should be able to make decisions about my own body.                                                                        [00:04:40.445]
00:02:50.960-->00:03:00.960:  I think it is right that women be involved on my behalf in the policies and the decisions that will affect my life.                                   [00:00:00.000]
00:03:00.960-->00:03:09.960:  I think it is right that socially I am afforded the same respect as men.                                                                              [00:00:00.000]
00:03:09.960-->00:03:22.960:  But sadly, I can say that there is no one country in the world where all women can expect to receive these rights.                                    [00:04:20.621]
00:03:22.960-->00:03:30.960:  No country in the world can yet say that they have achieved gender equality.                                                                          [00:00:00.000]
00:03:30.960-->00:03:34.960:  Thank you very, very much.                                                                                                                            [00:00:00.000]
00:03:34.960-->00:03:44.960:  [music]                                                                                                                                               [00:01:07.558]
00:03:44.960-->00:03:54.960:  [no audio]                                                                                                                                            [00:01:07.983]
⟫ Completed Whisper processing...
Sing303 commented 4 months ago

54s was on version 1.4.7? Or on version 1.5.0 with WithSpeedUp2x option? If WithSpeedUp2x, then 54s is the model loading time, not the transcribing time, because with WithSpeedUp2x transcribing is not performed at all

Anurag-RTS commented 4 months ago

54s on v1.4.7 with WithSpeedUp2x enabled, otherwise took ~2m13s iirc.

because with WithSpeedUp2x transcribing is not performed at all

This has been (still is) in production code, with WithSpeedUp2x enabled, with daily ~25 videos/audios getting transcribed then processed. But now, atleast on my machine, I can't reproduce that old behavior even if I go back to v1.4.7. 🤔

Anurag-RTS commented 4 months ago

Ok, I was misremembering my old version. After downgrading to v1.4.6, I got back the sub 1-min transcription times. I'll wait until ggerganov re-enables WithSpeedUp2x option in base library.