Closed ojak closed 9 years ago
I agree this would be confusing. What does sentence.print_tree show? The "n't" is supposed to be of class Enclitic (which is descendant of Token, not Word). So token_count should return 7 and word_count should return 6.
Here's print_tree
without parsing:
> sentence("This sentence doesn't contain seven words.").tokenize.print_tree
+ Sentence (70319169704620) --- "This sentence does [...] seven words." --- {} --- []
|
+--> Word (70319118319320) --- "This" --- {} --- []
+--> Word (70319118317200) --- "sentence" --- {} --- []
+--> Word (70319118315180) --- "does" --- {} --- []
+--> Word (70319118313160) --- "n't" --- {} --- []
+--> Word (70319118286420) --- "contain" --- {} --- []
+--> Word (70319118284200) --- "seven" --- {} --- []
+--> Word (70319118282040) --- "words" --- {} --- []
+--> Punctuation (70319118279820) --- "." --- {} --- []
And with parsing:
> sentence("This sentence doesn't contain seven words.").tokenize.parse.print_tree
+ Sentence (70319171751360) --- "This sentence does [...] seven words." --- {:tag_set=>:penn} --- []
|
+--+ Phrase (70319163075360) --- "This sentence" --- {:tag=>"NP"} --- []
|
+--> Word (70319162384740) --- "This" --- {:tag=>"DT"} --- []
+--> Word (70319161753420) --- "sentence" --- {:tag=>"NN"} --- []
+--+ Phrase (70319160833780) --- "does n't contain seven words" --- {:tag=>"VP"} --- []
|
+--> Word (70319160302720) --- "does" --- {:tag=>"VBZ"} --- []
+--> Word (70319159626700) --- "n't" --- {:tag=>"RB"} --- []
+--+ Phrase (70319125324640) --- "contain seven words" --- {:tag=>"VP"} --- []
|
+--> Word (70319124574860) --- "contain" --- {:tag=>"VB"} --- []
+--+ Phrase (70319123708220) --- "seven words" --- {:tag=>"NP"} --- []
|
+--> Word (70319134976880) --- "seven" --- {:tag=>"CD"} --- []
+--> Word (70319122487540) --- "words" --- {:tag=>"NNS"} --- []
+--> Punctuation (70319134697140) --- "." --- {:tag=>"."} --- []
So the real issue is that "n't" is being tokenized as a Word, when it should be an Enclitic. Therefore there's a bug in the default :ptb tokenizer. The tokenizers mostly all use this method to create tokens, and the enclitic should be parsed appropriately (line 350). Can you check what is happening there?
Cool. I'll take a look, thanks for locating it.
Yup. Typo in lib/treat/entities/entity/buildable.rb:18
:
Enclitics = %w['ll 'm 're 's 't 've 'nt]
Changed to:
Enclitics = %w['ll 'm 're 's 't 've n't]
Works!
sentence("This sentence doesn't contain seven words.").tokenize.word_count
=> 6
Sweet. Can you be so kind as to submit a PR?
Contractions are counted as two words instead of one with
#word_count
.For example, the current behavior when calling
#word_count
on the following sentence with 6 words returns an incorrect count of 7:This occurs because
#tokenize
splits the worddoesn't
into two tokenized words:does
andn't
.Perhaps
#word_count
should allow for arguments on#word_count
, where the method's defaults behavior is an accurate word count, but also allows for the total tokenized count to be explicitly requested? Or perhaps expose a#count
method on a tokenized segment. Something like:Any thoughts?