louismullie / treat

Natural language processing framework for Ruby.
Other
1.37k stars 127 forks source link

#word_count incorrectly counts contractions as two words #94

Closed ojak closed 9 years ago

ojak commented 9 years ago

Contractions are counted as two words instead of one with #word_count.

For example, the current behavior when calling #word_count on the following sentence with 6 words returns an incorrect count of 7:

sentence("This sentence doesn't contain seven words.").tokenize.word_count
=> 7

This occurs because #tokenize splits the word doesn't into two tokenized words: does and n't.

Perhaps #word_count should allow for arguments on #word_count, where the method's defaults behavior is an accurate word count, but also allows for the total tokenized count to be explicitly requested? Or perhaps expose a #count method on a tokenized segment. Something like:

sentence = sentence("This sentence doesn't contain seven words.")

sentence.tokenize.word_count
=> 6
sentence.tokenize.word_count(double_count_contractions:true)
=> 7

# Or, just expose a `#count` or `#length` method on a tokenized segment
sentence.tokenize.count
=> 7

Any thoughts?

louismullie commented 9 years ago

I agree this would be confusing. What does sentence.print_tree show? The "n't" is supposed to be of class Enclitic (which is descendant of Token, not Word). So token_count should return 7 and word_count should return 6.

ojak commented 9 years ago

Here's print_tree without parsing:

> sentence("This sentence doesn't contain seven words.").tokenize.print_tree
+ Sentence (70319169704620)  --- "This sentence does [...] seven words."  ---  {}   --- []
|
+--> Word (70319118319320)  --- "This"  ---  {}   --- []
+--> Word (70319118317200)  --- "sentence"  ---  {}   --- []
+--> Word (70319118315180)  --- "does"  ---  {}   --- []
+--> Word (70319118313160)  --- "n't"  ---  {}   --- []
+--> Word (70319118286420)  --- "contain"  ---  {}   --- []
+--> Word (70319118284200)  --- "seven"  ---  {}   --- []
+--> Word (70319118282040)  --- "words"  ---  {}   --- []
+--> Punctuation (70319118279820)  --- "."  ---  {}   --- []

And with parsing:

> sentence("This sentence doesn't contain seven words.").tokenize.parse.print_tree
+ Sentence (70319171751360)  --- "This sentence does [...] seven words."  ---  {:tag_set=>:penn}   --- []
|
+--+ Phrase (70319163075360)  --- "This sentence"  ---  {:tag=>"NP"}   --- []
   |
   +--> Word (70319162384740)  --- "This"  ---  {:tag=>"DT"}   --- []
   +--> Word (70319161753420)  --- "sentence"  ---  {:tag=>"NN"}   --- []
+--+ Phrase (70319160833780)  --- "does n't contain seven words"  ---  {:tag=>"VP"}   --- []
   |
   +--> Word (70319160302720)  --- "does"  ---  {:tag=>"VBZ"}   --- []
   +--> Word (70319159626700)  --- "n't"  ---  {:tag=>"RB"}   --- []
   +--+ Phrase (70319125324640)  --- "contain seven words"  ---  {:tag=>"VP"}   --- []
      |
      +--> Word (70319124574860)  --- "contain"  ---  {:tag=>"VB"}   --- []
      +--+ Phrase (70319123708220)  --- "seven words"  ---  {:tag=>"NP"}   --- []
         |
         +--> Word (70319134976880)  --- "seven"  ---  {:tag=>"CD"}   --- []
         +--> Word (70319122487540)  --- "words"  ---  {:tag=>"NNS"}   --- []
+--> Punctuation (70319134697140)  --- "."  ---  {:tag=>"."}   --- []
louismullie commented 9 years ago

So the real issue is that "n't" is being tokenized as a Word, when it should be an Enclitic. Therefore there's a bug in the default :ptb tokenizer. The tokenizers mostly all use this method to create tokens, and the enclitic should be parsed appropriately (line 350). Can you check what is happening there?

ojak commented 9 years ago

Cool. I'll take a look, thanks for locating it.

ojak commented 9 years ago

Yup. Typo in lib/treat/entities/entity/buildable.rb:18:

  Enclitics = %w['ll 'm 're 's 't 've 'nt]

Changed to:

  Enclitics = %w['ll 'm 're 's 't 've n't]

Works!

sentence("This sentence doesn't contain seven words.").tokenize.word_count
=> 6
louismullie commented 9 years ago

Sweet. Can you be so kind as to submit a PR?