learntextvis / textkit

Command line tool for manipulating and analyzing text
MIT License
28 stars 6 forks source link

Bug with textkit downloads? missing file? #44

Closed arnicas closed 7 years ago

arnicas commented 7 years ago

Just tried installing fresh on a new machine and did install via pip install -U textkit. Then did textkit download.

Got this - are we missing something in the downloads, or has something changed in nltk. I can't look into it myself today... [Update: after I manually installed punkt it works fine.]

(text) MAC20085:data cherny$ textkit text2words tweet_sample.txt 
Traceback (most recent call last):
  File "/Users/cherny/miniconda3/envs/text/bin/textkit", line 11, in <module>
    sys.exit(cli())
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/textkit/tokenize/words.py", line 12, in text2words
    tokens = nltk.word_tokenize(content)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource 'tokenizers/punkt/PY3/english.pickle' not found.
  Please use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/Users/cherny/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************
vlandham commented 7 years ago

thanks for the report! I think we might just be missing more checks for when the data is downloaded - and need to put up better error messages (instead of just blowing up)?

I'll try a clean install as well - to try to better figure out where additional error handling is needed.

vlandham commented 7 years ago

AH. i misunderstood your error message. sorry! yes - it looks like we also need to add that punkt to the downloads. Thanks!

vlandham commented 7 years ago

Ok. i was able to recreate your issue with python 2 and python 3.

45 includes punkt in the downloads list. It also adds slightly more informative (?) error handling to the tokenizer commands.

I've updated the version to 0.2.2 and published this version on PyPi. I also tested locally to verify this fixes the issue for me on python 2 and python 3.

I think this is now resolved - but if you have a second to try it out on your machine - i can leave the issue open till its confirmed to be working for you.

Thanks again!

arnicas commented 7 years ago

I just tried in in a new conda venv - saw the message "looking for punkt" so I imagine it's good now. Thanks!