pszemraj / vid2cleantxt

Python API & command-line tool to easily transcribe speech-based video files into clean text
Apache License 2.0
183 stars 25 forks source link

Whisper #14

Closed pszemraj closed 1 year ago

pszemraj commented 1 year ago

This PR adds integration for OpenAI's new whisper model, drastically increasing the quality of the output transcribed docs.

pszemraj commented 1 year ago

currently, "CLI" works while python package does not, mostly because this is only implemented in the transformers dev release which is not pip:


Obtaining file:///C:/Users/peter/code-dev-22/vid2cleantxt
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [3 lines of output]
      C:\Users\peter\miniconda3\envs\asr\lib\site-packages\setuptools\installer.py:27: SetuptoolsDeprecationWarning: setuptools.installer is deprecated. Requirements should be satisfied by a PEP 517 installer.
        warnings.warn(
      error in vid2cleantxt setup command: 'install_requires' must be a string or list of strings containing valid project/version requirement specifiers; Parse error at "'+https:/'": Expected stringEnd
      [end of output]
pszemraj commented 1 year ago

I will post a notebook to illustrate later but this might be a draft till the next prod version of transformers is released, which, I guess is fine since most of the implementation is done now (unless it changes)

pszemraj commented 1 year ago

here's a notebook illustrating

pszemraj commented 1 year ago

ok so CPU implementation seems ok, need to double check some cuda things for GPU

text_output, metadata_output = vid2cleantxt.transcribe.transcribe_dir(
    input_dir=".",
    model_id="openai/whisper-small.en",
    # chunk_length=30,
    # above are defaults to show important args
)

metadata_output

results in errors

Loading models @ Oct-11-2022_-00-17-17 - may take some time...
if RT seems excessive, try --verbose flag or checking logfile
Downloading: 100%
185k/185k [00:00<00:00, 878kB/s]
Downloading: 100%
810/810 [00:00<00:00, 8.46kB/s]
Downloading: 100%
999k/999k [00:00<00:00, 821kB/s]
Downloading: 100%
456k/456k [00:00<00:00, 4.70MB/s]
Downloading: 100%
52.7k/52.7k [00:00<00:00, 1.23MB/s]
Downloading: 100%
2.08k/2.08k [00:00<00:00, 55.8kB/s]
Downloading: 100%
1.72k/1.72k [00:00<00:00, 56.1kB/s]
Downloading: 100%
1.78k/1.78k [00:00<00:00, 63.6kB/s]
Downloading: 100%
967M/967M [00:16<00:00, 54.5MB/s]
Downloading: 100%
436M/436M [00:34<00:00, 50.2MB/s]
WARNING:root:Failed loading NeuSpell spellchecker, reverting to basic spellchecker
WARNING:root:invalid load key, '<'.
transcribing...: 100%
1/1 [00:16<00:00, 16.63s/it]
Creating .wav audio clips: 100%
8/8 [00:00<00:00, 97.93it/s]
Transcribing video: 100%
8/8 [00:09<00:00, 1.02s/it]
ERROR:root:Error transcribing chunk president_20_kennedy_27_s_20196220_speech_20_on_20_the_20_us_20_space_20_program_2020_c_span_clipaudio_0.wav in President20Kennedy27s20196220Speech20on20the20US20Space20Program2020CSPAN20Classroom.mp4 @ Oct-11-2022_-00
ERROR:root:Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
/content/vid2cleantxt/vid2cleantxt/transcribe.py:303: UserWarning: Error transcribing chunk - see log for details
  warnings.warn("Error transcribing chunk - see log for details")
ERROR:root:Error transcribing chunk president_20_kennedy_27_s_20196220_speech_20_on_20_the_20_us_20_space_20_program_2020_c_span_clipaudio_1.wav in President20Kennedy27s20196220Speech20on20the20US20Space20Program2020CSPAN20Classroom.mp4 @ Oct-11-2022_-00
ERROR:root:Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
ERROR:root:Error transcribing chunk president_20_kennedy_27_s_20196220_speech_20_on_20_the_20_us_20_space_20_program_2020_c_span_clipaudio_2.wav in President20Kennedy27s20196220Speech20on20the20US20Space20Program2020CSPAN20Classroom.mp4 @ Oct-11-2022_-00
ERROR:root:Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
ERROR:root:Error transcribing chunk president_20_kennedy_27_s_20196220_speech_20_on_20_the_20_us_20_space_20_program_2020_c_span_clipaudio_3.wav in President20Kennedy27s20196220Speech20on20the20US20Space20Program2020CSPAN20Classroom.mp4 @ Oct-11-2022_-00
ERROR:root:Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
ERROR:root:Error transcribing chunk president_20_kennedy_27_s_20196220_speech_20_on_20_the_20_us_20_space_20_program_2020_c_span_clipaudio_4.wav in President20Kennedy27s20196220Speech20on20the20US20Space20Program2020CSPAN20Classroom.mp4 @ Oct-11-2022_-00
ERROR:root:Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
/content/vid2cleantxt/vid2cleantxt/transcribe.py:303: UserWarning: Error transcribing chunk - see log for details
  warnings.warn("Error transcribing chunk - see log for details")
ERROR:root:Error transcribing chunk president_20_kennedy_27_s_20196220_speech_20_on_20_the_20_us_20_space_20_program_2020_c_span_clipaudio_5.wav in President20Kennedy27s20196220Speech20on20the20US20Space20Program2020CSPAN20Classroom.mp4 @ Oct-11-2022_-00
ERROR:root:Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
ERROR:root:Error transcribing chunk president_20_kennedy_27_s_20196220_speech_20_on_20_the_20_us_20_space_20_program_2020_c_span_clipaudio_6.wav in President20Kennedy27s20196220Speech20on20the20US20Space20Program2020CSPAN20Classroom.mp4 @ Oct-11-2022_-00
ERROR:root:Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
ERROR:root:Error transcribing chunk president_20_kennedy_27_s_20196220_speech_20_on_20_the_20_us_20_space_20_program_2020_c_span_clipaudio_7.wav in President20Kennedy27s20196220Speech20on20the20US20Space20Program2020CSPAN20Classroom.mp4 @ Oct-11-2022_-00
ERROR:root:Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
SC_pipeline - transcribed audio: 100%
1/1 [00:00<00:00, 10.98it/s]

/content/vid2cleantxt/v2clntxt_transc_metadata

colab_gpu__issue1.zip

pszemraj commented 1 year ago

okay, things work on both sides now that I realized I omitted sending inputs to the GPU (needed input_features = input_features.to(device)) fixed in 05d3454587b1b2a7da0655e0def94cdd0d7979aa above.

GPU notebook and tests work: see here CPU notebook and tests work, linked here

give it some tests locally or via CLI to get acquainted and stress test and then I think it's good to merge?

JonathanLehner commented 1 year ago

thanks! the code looks fine and the notebooks work. One thing that might be nice would be adding an audio output to the notebooks, we should improve the punctuation.

pszemraj commented 1 year ago

Ok @JonathanLehner, I made some much-needed changes to reduce verbosity. Give it a look and merge, please. If all the conversations are resolved it should be possible