shashikg / WhisperS2T

An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
MIT License
318 stars 32 forks source link

Randomly getting error while generating word timestamps #59

Open rahulmate opened 7 months ago

rahulmate commented 7 months ago

code `model = whisper_s2t.load_model(model_identifier="large-v2", asr_options={'word_timestamps': True},backend='TensorRT-LLM')

files = ['output.wav'] lang_codes = ['en'] tasks = ['transcribe'] initial_prompts = [None]

out = model.transcribe_with_vad(files, lang_codes=lang_codes, tasks=tasks, initial_prompts=initial_prompts, batch_size=16)`

For above code sometime it throws in below error for same file. Is there any explanation for it. `RuntimeError Traceback (most recent call last) Cell In[15], line 10 8 initial_prompts = [None] 9 start =time.time() ---> 10 out = model.transcribe_with_vad(files, 11 lang_codes=lang_codes, 12 tasks=tasks, 13 initial_prompts=initial_prompts, 14 batch_size=16) 15 end =time.time() 16 print(f"batch :: {16} time:: {end-start}")

File ~/temp_triton/triton_env/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, kwargs) 112 @functools.wraps(func) 113 def decorate_context(*args, *kwargs): 114 with ctx_factory(): --> 115 return func(args, kwargs)

File ~/temp_triton/triton_env/lib/python3.10/site-packages/whisper_s2t/backends/init.py:171, in WhisperModel.transcribe_with_vad(self, audio_files, lang_codes, tasks, initial_prompts, batch_size) 169 for signals, prompts, seq_len, seg_metadata, pbar_update in self.data_loader(audio_files, lang_codes, tasks, initial_prompts, batch_size=batch_size): 170 mels, seq_len = self.preprocessor(signals, seq_len) --> 171 res = self.generate_segment_batched(mels.to(self.device), prompts, seq_len, seg_metadata) 173 for res_idx, _seg_metadata in enumerate(seg_metadata): 174 responses[_seg_metadata['file_id']].append({**res[res_idx], 175 'start_time': round(_seg_metadata['start_time'], 3), 176 'end_time': round(_seg_metadata['end_time'], 3)})

File ~/temp_triton/triton_env/lib/python3.10/site-packages/whisper_s2t/backends/tensorrt/model.py:248, in WhisperModelTRT.generate_segment_batched(self, features, prompts, seq_lens, seg_metadata) 246 text_tokens = [[_t for _t in x[0] if _t < self.tokenizer.eot]+[self.tokenizer.eot] for x in result] 247 sotseqs = [tuple([-4:]) for _ in prompts] --> 248 word_timings = self.align_words(features, texts, text_tokens, sot_seqs, seq_lens, seg_metadata) 250 for _response, _word_timings in zip(response, word_timings): 251 _response['word_timestamps'] = _word_timings

File ~/temp_triton/triton_env/lib/python3.10/site-packages/whisper_s2t/backends/tensorrt/model.py:200, in WhisperModelTRT.align_words(self, features, texts, text_tokens, sot_seqs, seq_lens, seg_metadata) 198 tokenalignments = [[] for in seg_metadata] 199 for start_seq, req_idx in start_seq_wise_req.items(): --> 200 res = self.aligner_model.align(ctranslate2.StorageView.from_array(features[req_idx]), 201 start_sequence=list(start_seq), 202 text_tokens=[texttokens[] for _ in req_idx], 203 num_frames=list(seq_lens[req_idx].detach().cpu().numpy()), 204 median_filter_width=7) 206 for _res, _req_idx in zip(res, req_idx): 207 token_alignments[_req_idx] = _res

RuntimeError: No position encodings are defined for positions >= 448, but got position 454`

aleksandr-smechov commented 7 months ago

You can try adjusting the align_words method here to this:

for start_seq, req_idx in start_seq_wise_req.items():
    # adding adjusted_num_frames
    adjusted_num_frames = [min(frame, MAX_TEXT_TOKEN_LENGTH) for frame in seq_lens[req_idx].detach().cpu().numpy()]
    res = self.aligner_model.align(
        ctranslate2.StorageView.from_array(features[req_idx]),
        start_sequence=list(start_seq),
        text_tokens=[text_tokens[_] for _ in req_idx],
        num_frames=adjusted_num_frames,
        median_filter_width=7
    )

and adjusting data_collate_fn here to:

def data_collate_fn(self, batch):
    # adding max_seq_len_samples
    max_seq_len_samples = MAX_TEXT_TOKEN_LENGTH * (HOP_LENGTH * INPUT_STRIDE)
    if self.use_dynamic_time_axis:
        max_len = min(max([_[3] for _ in batch]) + self.dta_padding, N_SAMPLES, max_seq_len_samples)
    else:
        max_len = min(N_SAMPLES, max_seq_len_samples)

Let me know if that fixes anything @rahulmate

rahulmate commented 7 months ago

Thanks @aleksandr-smechov changes in align_words function solved the issue. I haven’t done benchmark yet but will run it to check the timestamps. For changes in data_collate_fn I was getting error with tensorRt model tensor

Could not set shape torch.Size([16, 80, 896]) for tensor x. Please check the profile range for which your model was build. Selection deleted Currently only using changes in align_words because originally I was getting issue with align model itself.

milosjovanov commented 4 months ago

For me, the above didn't solve anything. The issue I'm facing is that model (large-v3) is hallucinating and creating repetition of some phrases, which then increases length of chunk/tokens. Large-v2 didn't have this problem with this specific audio, but it did with some that were fine with large-v3. Overall, i would say that tensorrt-llm backend is showing more hallucinations than ctranslate2 is.

ValentinKovalev commented 1 week ago

Hello,

I've encountered the following error while trying to make changes to align_words and data_collate_fn:

Could not set shape torch.Size([16, 80, 896]) for tensor x. Please check the profile range for which your model was built. Selection deleted.

I initially tried modifying align_words alone, but it did not resolve the issue. Even after altering both align_words and data_collate_fn, the error persists.

Steps Taken:

Despite these changes, the error regarding torch.Size remains consistent.

Could you please provide guidance on how to address this issue?

Thank you for your assistance!