ufal / whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation
MIT License
1.81k stars 224 forks source link

Dealing with constant hallucinations #121

Open J-Korn opened 2 weeks ago

J-Korn commented 2 weeks ago

Using the large-v3 model to transcribe greek audio from a live stream, I am often met with continuous results writing "Υπότιτλοι AUTHORWAVE"

It seems the model is bugged in a way that outputs that phrase when it does not understand the input.

Setting vac and vad to True dos not seem to reduce that occurrence.

Is there some way I can discard this specific phrase or similar ones so they do not get confirmed and sent to the client?

Gldkslfmsd commented 1 week ago

hi, you can check whether it's the same with offline Whisper model with VAD on. If yes, then a better model can help, or make sure that the sound quality is good enough.

Alternatively, just remove that phrase from all transcripts before searching for the longest common prefix. But beware, it won't output it when you actually need it. And it may not work whenever Whisper hallucinates anything else.

J-Korn commented 1 week ago

This is not any actual greek phrase. My best guess is that the model was partially trained in greek using community generated subtitles for tv shows and whatnot, and they had the creator's name as an advertisement during moments of silence where actual captioning was not needed. "Υπότιτλοι" translates to "Subtitles" and "AUTHORWAVE" is not a greek word, or any word that means anything for that matter.

This is using the large-v3 model and I cannot find any model that does greek better than this. Do note that this also shows up when transcribing videos with the base Whisper.

For now I am attempting to remove it like this:


 if self.contains_unwanted_word(tsw, "AUTHORWAVE"):
            logger.debug("Discarding transcription result due to unwanted word 'AUTHORWAVE'")
            return None, None, ""

Where I check the words retrieved from self.asr.ts_words(res) during process_iter and return early if this is found. I am not sure this is the correct way to go about it though.

Edit: Related to the topic, I am also trying to add a check where if 5 seconds have passed without confirming a chunk, to just confirm everything it's got in the buffer, in an attempt to improve time at the cost of accuracy, This is my current process_iter, but I feel this particular change needs to be done someplace else:

def process_iter(self):
        """Runs on the current audio buffer.
        Returns: a tuple (beg_timestamp, end_timestamp, "text"), or (None, None, "").
        The non-emty text is confirmed (committed) partial transcript.
        """
        prompt, non_prompt = self.prompt()
        logger.debug(f"PROMPT: {prompt}")
        logger.debug(f"CONTEXT: {non_prompt}")
        logger.debug(f"transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}")
        res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)

        tsw = self.asr.ts_words(res)

        # Check if 'AUTHORWAVE' is in the transcription result
        if self.contains_unwanted_word(tsw, "AUTHORWAVE"):
            logger.debug("Discarding transcription result due to unwanted word 'AUTHORWAVE'")
            return None, None, ""

        self.transcript_buffer.insert(tsw, self.buffer_time_offset)
        o = self.transcript_buffer.flush()
        if o:
            self.commited.extend(o)
            self.last_confirmed_time = time.time()
            completed = self.to_flush(o)
            logger.debug(f">>>>COMPLETE NOW: {completed}")
        else:
            completed = None

        current_time = time.time()
        if current_time - self.last_confirmed_time > self.confirmation_timeout:
            logger.debug("Timeout exceeded. Forcing confirmation of available text.")
            self.force_confirm_text()

        the_rest = self.to_flush(self.transcript_buffer.complete())
        logger.debug(f"INCOMPLETE: {the_rest}")

                # there is a newly confirmed text

        if o and self.buffer_trimming_way == "sentence":  # trim the completed sentences
            if len(self.audio_buffer)/self.SAMPLING_RATE > self.buffer_trimming_sec:  # longer than this
                self.chunk_completed_sentence()

        if self.buffer_trimming_way == "segment":
            s = self.buffer_trimming_sec  # trim the completed segments longer than s,
        else:
            s = 30 # if the audio buffer is longer than 30s, trim it

        if len(self.audio_buffer)/self.SAMPLING_RATE > s:
            self.chunk_completed_segment(res)

        logger.debug(f"len of buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f}")
        return self.to_flush(o)
J-Korn commented 1 week ago

Any help on this matter would be greatly appreciated.

Gldkslfmsd commented 1 week ago

Hi, I'd like to help but I'm busy now. Small advice: Create a "development set" -- an audio on which the hallucination happens, on which can you measure the ASR quality quickly -- preferrably by WER compared to gold transcript, or at least by counting the number of the hallucinated words. Measure the quality with your change/with various parameters and without. Use it for decision whether to apply the change or not.

Btw. -- latency measure should be applied as well but can be neglected for start.

Gldkslfmsd commented 1 day ago

Hi, @J-Korn , if I were you, I would remove the unwanted word from tsw after you create it with tsw = self.asr.ts_words(res) . Then process the rest of process_iter function as it was.

J-Korn commented 1 day ago

@Gldkslfmsd I have found that when this specific hallucination occurs, it always outputs either "AUTHORWAVE" or "Υπότιτλοι AUTHORWAVE" on its own, never alongside any actual relevant transcriptions that I would want to keep. Going by this logic, I am attempting to discard the whole output by returning an empty tuple:

if self.contains_unwanted_word(tsw, "AUTHORWAVE"):
            logger.debug("Discarding transcription result due to unwanted word 'AUTHORWAVE'")
            return None, None, ""

Will this not work?

Gldkslfmsd commented 1 day ago

yes, if your observations are true than it makes sense