segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
753 stars 44 forks source link

Please explain length of output of wtp.predict_proba(text) #114

Closed gaggiag closed 9 months ago

gaggiag commented 9 months ago

Using an input text of n words, I was expecting to get n probabilities. But wtp.predict_proba(text) gives more than that.

>> len(wtp.predict_proba("Hello This is a test."))
21

How can I infer a probability of a boundary per word from this vector?

bminixhofer commented 9 months ago

Hi, this is the probability for every character in the text you passed as input. You can compute the probability on a word-level by looking at the probability for the last character in every word.

gaggiag commented 9 months ago

Thanks! Just making sure - these are new sentence probabilities, right? The fact that the probability is per character - does that mean that sentence boundaries can occur mid-words? Does that ever really happen?

bminixhofer commented 9 months ago

These are new sentence probabilities, right?

In practice, yes. But to be precise, there is some nuance: if you call it via wtp.predict_proba("Hello This is a test.") they are the probabilities for a new line (\n) to occur after any character which you can use as a proxy for new sentence probabilities. If you call it via e.g. wtp.predict_proba("Hello This is a test.", style="ud", lang_code="en") i.e. you use an adapted version, then they are actually new sentence probabilities.

Does that mean that sentence boundaries can occur mid-words? Does that ever really happen?

For English, no, probably not. For other languages (like Chinese and Thai) which do not have a space between words there doesn't need to be a whitespace for a new sentence to start.