Bug: split_sentence does not seem to handle newlines well

thiswillbeyourgithub commented 2 months ago

Hi,

I was just playing around with split_sentence and noticed that :

In [16]: split_sentence("This is a test\nAnd here's another one", "en", 25)
Out[16]: ["This is a test And here's", 'another one']

In [17]: split_sentence("This is a test.And here's another one", "en", 25)
Out[17]: ['This is a test.', "And here's another one"]

Given that I use markdown bullet points a lot, I often have line that end with no punctuation.

What do you think about automatically replacing newlines by a point if it's not already following a punctuation mark?

Also, there's no env variable to set the text length for the splitter right? I think lowering that would too reduce my VRAM need. Any opinion on this?

matatonic commented 2 months ago

Good problem to know about, thanks. I'll consider this when updating to better support markdown generation.

Re: #56

thiswillbeyourgithub commented 1 month ago

Maybe a simple fix would be to first pass the text through pysbd instead of split_sentence. And only pass sentence that are longer than some limit to split_sentence.

I discovered pysbd trough another of your repos so am also curious about why you used it in some places but not this time.

matatonic commented 1 month ago

I did have a version with pysbd instead, but found no major difference except that perhaps sentence_split was perhaps better for some languages. So why include the extra dependency? Anyways, I'm probably going to restore it after I look more deeply into this problem.

matatonic / openedai-speech

Bug: split_sentence does not seem to handle newlines well #60