m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.74k stars 1.24k forks source link

Natural Subtitle Segmentation and Splitting without trashing the readability. #829

Open ankitgurua opened 3 months ago

ankitgurua commented 3 months ago

I asked for an issue with both Whisper and WhisperX that kills the readability of the subtitle whenever you put the length limits. Fullstops appearing mid sentences, segments splitting people's names. Random sentence cuts that felt unnatural.

To deal with this i found this spacy python file (credits to Glenn Langford) which can do all of the above for us while also putting length limits. It basically redeems the readability of the subtitle no matter your character or max lines value. The script shortens your subtitles while maintaining the natural flow by splitting the subtitles at punctuation and conjunctions and natural words. It takes care of not splitting at nouns and people's names and city names.

But there's a problem this script only works with whisper. When i tried running it on WhisperX JSON output it straight up gave me errors. I understand this is because of the structural differences in WhisperX and Whisper. But i really wanna run this script with WhisperX as timestamps of original whisper give me headaches.

If you want to run this script with original Whisper do this.

Install Python pip install -U pip setuptools wheel pip install -U 'spacy[cuda11x]' python -m spacy download en_core_web_trf Run this python script with JSON in same directory (https://gist.githubusercontent.com/glangford/a2b24ffd92c832c60e1b1b49da1a8b27/raw/c588b33d2598f7ef92a26edf3dc314d119a70602/subwisp.py) python3 -m subwisp input.json >output.srt

ViiTetrix commented 3 months ago

Did you solve this problem?

ankitgurua commented 3 months ago

Did you solve this problem?

Yes i did

Basically some changes in the script made it run with whisperX as well

ViiTetrix commented 3 months ago

So can you teach me how to do this, I try to change it many times,but I failed

ViiTetrix commented 3 months ago

I have only implemented it on whisper. The JSON file output by whisperx does not contain the relevant token information, and I do not know how to handle this part in the script.

ankitgurua commented 3 months ago

I have only implemented it on whisper. The JSON file output by whisperx does not contain the relevant token information, and I do not know how to handle this part in the script.

Yes, i created different spacy files for whisperX and whispertimestamped, https://gist.github.com/ankitgurua/eac069ed0c95e1ce5924a10923883133 https://gist.github.com/ankitgurua/7b0db06baa8e2c7288cbbf396169120d

the problem with whisperX is that its alignment model cannot align numbers so its json have numbers in sentences but dont have timestamps for it, so to deal with whisperX output you should use --suppress_numerals in your command and then use this spacy script i provided to segment it.

ViiTetrix commented 3 months ago

It works and I understand what you mean. Regarding the lack of timestamp information for numbers, I've been inferring and filling in this data using the timestamps of adjacent words. However, during processing, the original spacy file likely uses the token information from the JSON file generated by Whisper - a feature that WhisperX lacks. I'm uncertain whether this difference impacts Spacy's functionality.

ankitgurua commented 3 months ago

It works and I understand what you mean. Regarding the lack of timestamp information for numbers, I've been inferring and filling in this data using the timestamps of adjacent words. However, during processing, the original spacy file likely uses the token information from the JSON file generated by Whisper - a feature that WhisperX lacks. I'm uncertain whether this difference impacts Spacy's functionality.

I think i finally found a perfect way to get subtitles, in which i get near perfect aligning of timestamps and the length of the sentence is also naturally segregated. Using this spacy script.

What i did is basically divided what whisperX does in 3 parts.

Transcription, Aligning, Segmenting

What i did is do them all seperately and then combine those different methods to work with each other.

It's a bit long, but its perfect.

hongyuhei7722 commented 2 months ago

What i did is basically divided what whisperX does in 3 parts. Transcription, Aligning, Segmenting ——Could you tell me how to do? Thank you very much.

ViiTetrix commented 2 months ago

Transcription → WhisperX Aligning → WhisperX -wav2vec2 Segmenting → Spacy

GenSolUSA commented 4 weeks ago

Transcription → WhisperX Aligning → WhisperX -wav2vec2 Segmenting → Spacy

  • You can find the script and environmental requirements above

Looking at everything above but having trouble putting it together. Can you please elaborate?