nyrahealth / CrisperWhisper

Verbatim Automatic Speech Recognition with improved word-level timestamps and filler detection
Other
263 stars 11 forks source link

Timestamps are None? Why? How to handle them? #6

Open david-gimeno opened 1 month ago

david-gimeno commented 1 month ago

Hi :)

First of all, of course, congrats for your work. I think CrisperWhisper is going to be so useful for the research community!

However, I am creating this issue because I am noticing that, when processing my data, sometime the timestamps are None. I found this error, whose traceback is here:

Traceback (most recent call last):

  File "~/CrisperWhisper/get_crisper_whisper_transcripts.py", line 117, in <module>
    crisper_whisper_result = adjust_pauses_for_hf_pipeline_output(hf_pipeline_output)

  File "~/CrisperWhisper/utils.py", line 14, in adjust_pauses_for_hf_pipeline_output
    pause_duration = next_start - current_end

TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'

Why is this happening? Is there a way to handle this situation, e.g., a try-catch to establish pause_duration=0 in case this happens. I have to process quite a lot of data and, although i would prefer other solution, i can assume a certain amount of mistakes.

Thanks in advance. Best regards, David.

LaurinmyReha commented 1 month ago

Hi :)

Thank you! Very happy to hear that you are using it and find it adds value.

I have sometimes seen this when using a different language tag then the one of the speaker in the audio sample you are trying to transcribe. Please also try it when installing our custom transformers fork that improves some aspects of the DTW alignment running in the background. You can install it with:

pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper

If this does not resolve your issue its unfortunately always hard to debug without having access to the audio where it is occuring..... But if this does not resolve it let me know and we can look into it further :)

david-gimeno commented 1 month ago

Thanks for your quick response! However, I already installed that custom fork, cause I found it in a closed issue. My data is in English,so it shouldn't be a problem with a language mismatch. Unfortunately, I cannot share the data because of ethical considerations. Let me go more deeper into the problem to see what can happen :S In the worst case i would handle the exception :(

LaurinmyReha commented 1 month ago

Uff i see. Well sorry i cant help you, tough to say whats going wrong here without being able to debug into it. I would however assume its always the very last timestamp thats None? If that would be the case then you could maybe adjust this function here slightly

adjust_pauses_for_hf_pipeline_output

So you would only adjust this last timestamp with something manaul that makes sense.... for example the last timestamp + average word length or something like this depending on the application....

If you encounter this with a audio that you can share i would be glad to help you :)

david-gimeno commented 1 month ago

Make sense your adjustment based on the average duration. Here you can see what I found when printing the timestamps:

48.74 48.72                                                                                                                                   
48.82 48.8                                                                                                                                    
49.04 49.02                                                                                                                                   
49.16 49.14                                                                                                                                   
49.24 49.22                                                                                                                                   
49.44 49.42                                                                                                                                   
49.52 49.5                                                                                                                                    
49.68 49.66                                                                                                                                   
49.92 49.9                                                                                                                                    
44.2 None

The problem is the very last timestamp. What is weird is that even the start_timestamp is lower than the previous one :S Additionally, I can tell you that the model hallucinated a bit with the transcription.

LaurinmyReha commented 1 month ago

Would love to look into that with the audiofile. Hard to tell otherwise. Playing around with the beam_size often helps quite a bit with hallucinations..... Generally a heuristic for detecting hallucinations should be that timestamps on hallucinated content become very short so could be filtered on that (atleast partly). I am soon going to look into the actual decoder cross attention heads and see if one can clearly detect hallucinations from unusual cross attention behaviour of those dedicated heads and improve on the current version.

david-gimeno commented 1 month ago

Regarding the beam size, can i modify this hyper-parameter using the model through Hugging Face? This is my code after following your tutorial:

    model = AutoModelForSpeechSeq2Seq.from_pretrained(
        args.model_id,
        torch_dtype=torch_dtype,
        low_cpu_mem_usage=True,
        use_safetensors=True,
    ).to(device)

    model_processor = AutoProcessor.from_pretrained(
        args.model_id,
    )

    model_pipeline = pipeline(
       'automatic-speech-recognition',
        model=model,
        tokenizer=model_processor.tokenizer,
        feature_extractor=model_processor.feature_extractor,
        chunk_length_s=30,
        batch_size=1,
        return_timestamps='word',
        torch_dtype=torch_dtype,
        device=device,
    )

Does the pipeline() have a kwargs or something similar? I opted not using the FasterWhisper approach because you warned that you couldn't guaranteed good timestamp calculation.

By the way, I am experimenting other kind of issues. Look at this trace:

Traceback (most recent call last):
  File "~/CrisperWhisper/get_crisper_whisper_transcripts.py", line 129, in <module>
    hf_pipeline_output = crisper_whisper(waveform)
  File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 292, in __call__
    return super().__call__(inputs, **kwargs)
  File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1154, in __call__
    return next(
  File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "~anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
  File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
    tokens = self.model.generate(
  File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py", line 624, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
  File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py", line 316, in _extract_token_timestamps
    timestamps[batch_idx, 1:] = torch.tensor(jump_times)
RuntimeError: The expanded size of the tensor (4) must match the existing size (5) at non-singleton dimension 0.  Target sizes: [4].  Tensor sizes: [5]

Fortunately, it was a problem with the original whisper model and it has been solved by increasing the chunk_length_s parameter to a higher value, e.g., 100. No idea why but I was inspired by this thread.