sloria / textblob-aptagger

*Deprecated* A fast and accurate part-of-speech tagger for TextBlob.
MIT License
103 stars 40 forks source link

Updated to be compatible with TextBlob 0.9.0. #4

Closed matheuscas closed 10 years ago

matheuscas commented 10 years ago

This pull request it is supposed to close issues 3 that I opened. I've updated imports, code, tests, changelog and minimum requirements (TextBlob 0.9).

sloria commented 10 years ago

This looks good! Thanks for contributing.

One last thing: the Travis build is failing because not all the corpora are being downloaded. You can add the following lines to .travis.yml to get all the necessary corpora:

before_install:
  - "wget https://s3.amazonaws.com/textblob/nltk_data.tar.gz"
  - "tar -xzvf nltk_data.tar.gz -C ~"
matheuscas commented 10 years ago

Ok, then. I'll put this on .travis.yml and I'll try again.

matheuscas commented 10 years ago

Yes, sure. But would you mind to enlighten me the reasons? Just for learning purposes. :)

matheuscas commented 10 years ago

Is not working. Two tests are failing when I use what you suggested. See it:

======================================================================
FAIL: test_tag (tests.test_taggers.TestPerceptronTagger)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/matheuscas/Development/python/textblob-aptagger/tests/test_taggers.py", line 44, in test_tag
    'better', 'than', 'complicated', '.'])
AssertionError: Lists differ: [] != [u'Simple', u'is', u'better', ...

Second list contains 12 additional elements.
First extra element 0:
Simple

- []
+ [u'Simple',
+  u'is',
+  u'better',
+  u'than',
+  u'complex',
+  u'.',
+  u'Complex',
+  u'is',
+  u'better',
+  u'than',
+  u'complicated',
+  u'.']

======================================================================
FAIL: test_tag_textblob (tests.test_taggers.TestPerceptronTagger)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/matheuscas/Development/python/textblob-aptagger/tests/test_taggers.py", line 53, in test_tag_textblob
    'better', 'than', 'complicated'])
AssertionError: Lists differ: [] != [u'Simple', u'is', u'better', ...

Second list contains 10 additional elements.
First extra element 0:
Simple

- []
+ [u'Simple',
+  u'is',
+  u'better',
+  u'than',
+  u'complex',
+  u'Complex',
+  u'is',
+  u'better',
+  u'than',
+  u'complicated']

----------------------------------------------------------------------
Ran 5 tests in 5.480s

FAILED (failures=2)
sloria commented 10 years ago

Oh I see. word_tokenize and sent_tokenize both return a generator rather than a list, which is more memory-efficient than keeping all tokens in memory. However, the list comp on line 50 exhausts the generator of words, which is why the for loop on line 51 does not make any iterations.

I think what you have is fine. Thanks for the contribution.

matheuscas commented 10 years ago

You're welcome. Keep doing the good work.