Open david-gimeno opened 1 month ago
Hi :)
Thank you! Very happy to hear that you are using it and find it adds value.
I have sometimes seen this when using a different language tag then the one of the speaker in the audio sample you are trying to transcribe. Please also try it when installing our custom transformers fork that improves some aspects of the DTW alignment running in the background. You can install it with:
pip install git+https://github.com/nyrahealth/transformers.git@crisper_whisper
If this does not resolve your issue its unfortunately always hard to debug without having access to the audio where it is occuring..... But if this does not resolve it let me know and we can look into it further :)
Thanks for your quick response! However, I already installed that custom fork, cause I found it in a closed issue. My data is in English,so it shouldn't be a problem with a language mismatch. Unfortunately, I cannot share the data because of ethical considerations. Let me go more deeper into the problem to see what can happen :S In the worst case i would handle the exception :(
Uff i see. Well sorry i cant help you, tough to say whats going wrong here without being able to debug into it. I would however assume its always the very last timestamp thats None? If that would be the case then you could maybe adjust this function here slightly
adjust_pauses_for_hf_pipeline_output
So you would only adjust this last timestamp with something manaul that makes sense.... for example the last timestamp + average word length or something like this depending on the application....
If you encounter this with a audio that you can share i would be glad to help you :)
Make sense your adjustment based on the average duration. Here you can see what I found when printing the timestamps:
48.74 48.72
48.82 48.8
49.04 49.02
49.16 49.14
49.24 49.22
49.44 49.42
49.52 49.5
49.68 49.66
49.92 49.9
44.2 None
The problem is the very last timestamp. What is weird is that even the start_timestamp is lower than the previous one :S Additionally, I can tell you that the model hallucinated a bit with the transcription.
Would love to look into that with the audiofile. Hard to tell otherwise. Playing around with the beam_size often helps quite a bit with hallucinations..... Generally a heuristic for detecting hallucinations should be that timestamps on hallucinated content become very short so could be filtered on that (atleast partly). I am soon going to look into the actual decoder cross attention heads and see if one can clearly detect hallucinations from unusual cross attention behaviour of those dedicated heads and improve on the current version.
Regarding the beam size, can i modify this hyper-parameter using the model through Hugging Face? This is my code after following your tutorial:
model = AutoModelForSpeechSeq2Seq.from_pretrained(
args.model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
).to(device)
model_processor = AutoProcessor.from_pretrained(
args.model_id,
)
model_pipeline = pipeline(
'automatic-speech-recognition',
model=model,
tokenizer=model_processor.tokenizer,
feature_extractor=model_processor.feature_extractor,
chunk_length_s=30,
batch_size=1,
return_timestamps='word',
torch_dtype=torch_dtype,
device=device,
)
Does the pipeline()
have a kwargs
or something similar? I opted not using the FasterWhisper
approach because you warned that you couldn't guaranteed good timestamp calculation.
By the way, I am experimenting other kind of issues. Look at this trace:
Traceback (most recent call last):
File "~/CrisperWhisper/get_crisper_whisper_transcripts.py", line 129, in <module>
hf_pipeline_output = crisper_whisper(waveform)
File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 292, in __call__
return super().__call__(inputs, **kwargs)
File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1154, in __call__
return next(
File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
item = next(self.iterator)
File "~anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
processed = self.infer(next(self.iterator), **self.params)
File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1068, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
tokens = self.model.generate(
File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py", line 624, in generate
outputs["token_timestamps"] = self._extract_token_timestamps(
File "~/anaconda3/envs/crisper_whisper/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py", line 316, in _extract_token_timestamps
timestamps[batch_idx, 1:] = torch.tensor(jump_times)
RuntimeError: The expanded size of the tensor (4) must match the existing size (5) at non-singleton dimension 0. Target sizes: [4]. Tensor sizes: [5]
Fortunately, it was a problem with the original whisper model and it has been solved by increasing the chunk_length_s
parameter to a higher value, e.g., 100. No idea why but I was inspired by this thread.
Hi :)
First of all, of course, congrats for your work. I think CrisperWhisper is going to be so useful for the research community!
However, I am creating this issue because I am noticing that, when processing my data, sometime the timestamps are
None
. I found this error, whose traceback is here:Why is this happening? Is there a way to handle this situation, e.g., a try-catch to establish pause_duration=0 in case this happens. I have to process quite a lot of data and, although i would prefer other solution, i can assume a certain amount of mistakes.
Thanks in advance. Best regards, David.