Closed moeseth closed 7 years ago
It's not designed to do that. WordSegment is for transforming "wheninthecourseofhumanevents" into ["when", "in", "the", "course", "of", "human", "events"]. It doesn't do anything more than that. In your case, the value "I'm from New York." was cleaned as "imfromnewyork" and then segmented.
Sorry, it doesn't parse bigrams. Was there something in the docs that was misleading or confusing?
inside README.md's features,
it says "Includes unigram and bigram data"
I'm confused as in why bigram data is there without anything using it?
The algorithm does use the bigram data. See the score
function for details: https://github.com/grantjenks/wordsegment/blob/master/wordsegment.py#L63 The comment there reads:
# Conditional probability of the word given the previous
# word. The technical name is *stupid backoff* and it's
# not a probability distribution but it works well in
# practice.
And you can learn more about stupid backoff in http://www.aclweb.org/anthology/D07-1090.pdf
Bigrams data is also useful for exploration, like comparing "in the" and "in a":
>>> import wordsegment
>>> wordsegment.load()
>>> wordsegment.BIGRAMS['in the']
1628795324.0
>>> wordsegment.BIGRAMS['in a']
364730082.0
I have the following string.
I used the following wordsegment python package.
However, I got the following response where New and York aren't together.
I can see that New York is in wordsegment bigrams corpus. But I'm not sure why it is not giving me New York together.
Thanks.