pszemraj / vid2cleantxt

Python API & command-line tool to easily transcribe speech-based video files into clean text
Apache License 2.0
186 stars 26 forks source link

Error when trying to use it in a one hour video #22

Open vreabernardo opened 3 days ago

vreabernardo commented 3 days ago

Error transcribing chunk 25 in video.mp4 The length of decoder_input_ids, including special start tokens, prompt tokens, and previous tokens, is 2, and max_new_tokens is 512. Thus, the combined length of decoder_input_ids and max_new_tokens is: 514. This exceeds the max_target_positions of the Whisper model: 448. You should either reduce the length of your prompt, or reduce the value of max_new_tokens, so that their combined length is less than 448.

echo-lalia commented 1 day ago

This error is also happening for me. I tried it with a venv using the quick start guide and the example video, and am getting the exact same error messages.

I also tried the linked Colab notebook, and got the same error. Here is the full information that gets printed in the Colab doc:

/usr/local/lib/python3.10/dist-packages/neuspell/seq_modeling/sclstmbert.py:23: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint_data = torch.load(os.path.join(checkpoint_path, "model.pth.tar"), map_location=map_location)

transcribing...: 100%
 1/1 [00:02<00:00,  2.17s/it]
Creating .wav audio clips: 100%
 8/8 [00:00<00:00, 169.84it/s]
Transcribing video: 100%
 8/8 [00:00<00:00, 28.34it/s]

/usr/local/lib/python3.10/dist-packages/vid2cleantxt/transcribe.py:306: UserWarning: Error transcribing chunk 0 - see log for details
  warnings.warn(f"Error transcribing chunk {i} - see log for details")
/usr/local/lib/python3.10/dist-packages/vid2cleantxt/transcribe.py:306: UserWarning: Error transcribing chunk 1 - see log for details
  warnings.warn(f"Error transcribing chunk {i} - see log for details")
/usr/local/lib/python3.10/dist-packages/vid2cleantxt/transcribe.py:306: UserWarning: Error transcribing chunk 2 - see log for details
  warnings.warn(f"Error transcribing chunk {i} - see log for details")
/usr/local/lib/python3.10/dist-packages/vid2cleantxt/transcribe.py:306: UserWarning: Error transcribing chunk 3 - see log for details
  warnings.warn(f"Error transcribing chunk {i} - see log for details")
/usr/local/lib/python3.10/dist-packages/vid2cleantxt/transcribe.py:306: UserWarning: Error transcribing chunk 4 - see log for details
  warnings.warn(f"Error transcribing chunk {i} - see log for details")
/usr/local/lib/python3.10/dist-packages/vid2cleantxt/transcribe.py:306: UserWarning: Error transcribing chunk 5 - see log for details
  warnings.warn(f"Error transcribing chunk {i} - see log for details")
/usr/local/lib/python3.10/dist-packages/vid2cleantxt/transcribe.py:306: UserWarning: Error transcribing chunk 6 - see log for details
  warnings.warn(f"Error transcribing chunk {i} - see log for details")
/usr/local/lib/python3.10/dist-packages/vid2cleantxt/transcribe.py:306: UserWarning: Error transcribing chunk 7 - see log for details
  warnings.warn(f"Error transcribing chunk {i} - see log for details")

SC_pipeline - transcribed audio: 100%
 1/1 [00:00<00:00, 30.69it/s]

And, the resulting text files are empty.

echo-lalia commented 1 day ago

Based on the log messages, I was able to find a quick fix.
Since I don't know what caused this to become broken in the first place, I'm worried this fix may be missing the real issue. But, this change works for me:

In vid2cleantxt/transcribe.py, line 236, change:

    chunk_max_new_tokens=512,

to:

    chunk_max_new_tokens=446,

This stops the above error, and allows the transcription to complete successfully.

pszemraj commented 12 hours ago

hey, thanks for reporting this and the PR. I'll give it a look over the next few days. It's definitely possible some code got shifted around in transformers as it's been a while since I updated this

will report back here and on the PR once I have a chance to look at it!