sandrohanea / whisper.net

Whisper.net. Speech to text made simple using Whisper Models
MIT License
505 stars 77 forks source link

Poor real-time speech recognition performance for Chinese #160

Closed eugeneYz closed 5 months ago

eugeneYz commented 5 months ago

7e44c0c044546f19ee7b42deffd4469

        private void LoadModel()
        {
            whisperFactory = WhisperFactory.FromPath("\\ggml-small.bin");
            processor = whisperFactory.CreateBuilder()
            .WithLanguage("zh")
            .Build();
        }

The effect of wav file recognition is good, but there will be some irrelevant results in real-time speech recognition. I use Naudio, Wavesource. BufferMilliseconds = 2000, recognizing it once after recording for 2 seconds.

eugeneYz commented 5 months ago

I use NAudio to record audio and save it as a WAV file locally. When I immediately use Whisper for prediction, it recognizes irrelevant content. However, when I don't record again and instead use Whisper to predict the same local file, it can recognize the audio content. Are there any specific things I should be aware of that I might have overlooked? Thank you.

        private void AudioRecStartHandle(object obj)
        {
            //if (File.Exists(tempWavFileName)) { File.Delete(tempWavFileName); }
            nAudioHelper.StartRec();
        }

        private async void AudioRecStopHandle(object obj)
        {
            try
            {
                nAudioHelper.StopRec();
                Thread.Sleep(500);
                using FileStream fileStream = new FileStream(tempWavFileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
                await foreach (var result in processor.ProcessAsync(fileStream))
                {
                    string recognizedText = result.Text;
                    if (string.IsNullOrEmpty(recognizedText))
                    {
                        fileStream.Dispose();
                        break;
                    }
.....     }

        public void StartRec()
        {
            WaveSource = new WaveIn();
            var filePath = "D:\\2_WPF画面\\WPFSamples-main\\WpfControlsX\\WpfControlsX\\Resource\\temp.wav";
            WaveSource.WaveFormat = new WaveFormat(16000, 16, 1); // 16bit,16KHz,Mono的录音格式
            writer = new WaveFileWriter(filePath, WaveSource.WaveFormat);

            WaveSource.BufferMilliseconds = 3000;
            WaveSource.DataAvailable += Recording;
            WaveSource.RecordingStopped += RecordingStopped;
            WaveSource.StartRecording();

        }

        public void StopRec()
        {
                try
                {
                    WaveSource?.StopRecording();
                    // Close Wave(Not needed under synchronous situation)
                    WaveSource?.Dispose();
                    WaveSource = null;
                }
                catch (Exception e)
                {
                    DialogHelper.Error(e.ToString());
                }
        }

        private void Recording(object sender, WaveInEventArgs e)
        {
            writer?.Write(e.Buffer, 0, e.BytesRecorded);
        }

        private void RecordingStopped(object sender, StoppedEventArgs e)
        {
            writer?.Close();
            writer?.Dispose();
            writer = null;

        }
sandrohanea commented 5 months ago

Real-time processing is not fully supported as described here: https://github.com/sandrohanea/whisper.net/issues/25

When you just send partial results, those might have half-words in them and no token can be understood (especially for Chinese, where a token is usually a lot longer in duration).