Closed gaggiag closed 9 months ago
Hi, this is the probability for every character in the text you passed as input. You can compute the probability on a word-level by looking at the probability for the last character in every word.
Thanks! Just making sure - these are new sentence probabilities, right? The fact that the probability is per character - does that mean that sentence boundaries can occur mid-words? Does that ever really happen?
These are new sentence probabilities, right?
In practice, yes. But to be precise, there is some nuance: if you call it via wtp.predict_proba("Hello This is a test.")
they are the probabilities for a new line (\n) to occur after any character which you can use as a proxy for new sentence probabilities. If you call it via e.g. wtp.predict_proba("Hello This is a test.", style="ud", lang_code="en")
i.e. you use an adapted version, then they are actually new sentence probabilities.
Does that mean that sentence boundaries can occur mid-words? Does that ever really happen?
For English, no, probably not. For other languages (like Chinese and Thai) which do not have a space between words there doesn't need to be a whitespace for a new sentence to start.
Using an input text of n words, I was expecting to get n probabilities. But wtp.predict_proba(text) gives more than that.
How can I infer a probability of a boundary per word from this vector?