louismullie / treat

Natural language processing framework for Ruby.
Other
1.37k stars 128 forks source link

Implosion/to_s problem with Enclitics #68

Open n8 opened 10 years ago

n8 commented 10 years ago
    text = "It's about time."
    text = sentence(text).apply(:tokenize, :parse)
    puts text.to_s

Results in:

It 's about time.

Should that to_s without the extra space between It and `s?

chrisanderton commented 10 years ago

it's is a contraction - for tokenisation contractions are often considered two words (because they are really) - this is the case in Stanford Core - http://stackoverflow.com/questions/14058399/stanford-corenlp-split-words-ignoring-apostrophe

One option, as suggested in the above link, would be to handle imploding enclitics in the implode method - in treat this would be in module Treat::Entities::Entity::Stringable

chrisanderton commented 10 years ago

so - looks like the issue is with the current implode method on string able - although it attempts to handle enclitics then from what i can see in the current implementation then 'value' would already be blank, so calling strip! would make no difference - when the imploded parts are merged the space is still there (as it is outside the scope of the strip!)

here's a fixed version - modified the recursive call to pass the value string and operations are all performed on the string instead of multiple copies - but a disclaimer is that i only started looking at treat about 3 hours ago!

https://github.com/chris-at-thewebfellas/treat/commit/d9b912f24d7673863ca3ea7e59016f022923ac66

for the same code, this now gives:

It's about time.