grantjenks / python-wordsegment

English word segmentation, written in pure-Python, and based on a trillion-word corpus.
http://www.grantjenks.com/docs/wordsegment/
Other
365 stars 49 forks source link

Bigram doesn't work. #9

Closed moeseth closed 7 years ago

moeseth commented 7 years ago

I have the following string.

I'm from New York.

I used the following wordsegment python package.

import wordsegment
wordsegment.segment("I'm from New York.")

However, I got the following response where New and York aren't together.

['im', 'from', 'new', 'york']

I can see that New York is in wordsegment bigrams corpus. But I'm not sure why it is not giving me New York together.

Thanks.

grantjenks commented 7 years ago

It's not designed to do that. WordSegment is for transforming "wheninthecourseofhumanevents" into ["when", "in", "the", "course", "of", "human", "events"]. It doesn't do anything more than that. In your case, the value "I'm from New York." was cleaned as "imfromnewyork" and then segmented.

Sorry, it doesn't parse bigrams. Was there something in the docs that was misleading or confusing?

moeseth commented 7 years ago

inside README.md's features,

it says "Includes unigram and bigram data"

I'm confused as in why bigram data is there without anything using it?

grantjenks commented 7 years ago

The algorithm does use the bigram data. See the score function for details: https://github.com/grantjenks/wordsegment/blob/master/wordsegment.py#L63 The comment there reads:

            # Conditional probability of the word given the previous
            # word. The technical name is *stupid backoff* and it's
            # not a probability distribution but it works well in
            # practice.

And you can learn more about stupid backoff in http://www.aclweb.org/anthology/D07-1090.pdf

Bigrams data is also useful for exploration, like comparing "in the" and "in a":

>>> import wordsegment
>>> wordsegment.load()
>>> wordsegment.BIGRAMS['in the']
1628795324.0
>>> wordsegment.BIGRAMS['in a']
364730082.0