jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.59k stars 177 forks source link

How to Split Overly Long Sentences into Two Without Affecting Other Sentences in segments Processing #277

Closed zxl777 closed 8 months ago

zxl777 commented 10 months ago

Discussed in https://github.com/jianfch/stable-ts/discussions/274

Originally posted by **zxl777** December 30, 2023 My current setting is that each complete sentence is on one line, each ending with a period '.'. However, occasionally there are overly long sentences. How can I split these long sentences into two using certain rules, without affecting the other sentences? The current Regrouping Methods apply to all sentences, so setting a rule will impact others. This leads to complex debugging. But I really only want to handle a few particularly long sentences. Example ``` ....... I've not come across anything like it. It's very useful for meditation because in the Heart Sutra, it says that by realizing all of these mechanisms of mind or these these aggregates by realizing them to be empty of self we become enlightened. We detach our sense of self from them. ........ ```
jianfch commented 10 months ago

You can use lock=True for splitting by period. It prevent future merging or splitting at those same spots.

zxl777 commented 10 months ago

I've tried using the 'lock' parameter, but it's challenging to meet my requirements. I simply wish to split the occasional long sentences in the middle into two segments to fulfill my needs.

Are there any other recommended methods, such as applying special treatment only to specific segments?

jianfch commented 10 months ago

You can use split_by_length() and specifying long length .e.g. max_chars=50.