arnicas commented 7 years ago

Just tried installing fresh on a new machine and did install via pip install -U textkit. Then did textkit download.

Got this - are we missing something in the downloads, or has something changed in nltk. I can't look into it myself today... [Update: after I manually installed punkt it works fine.]

(text) MAC20085:data cherny$ textkit text2words tweet_sample.txt 
Traceback (most recent call last):
  File "/Users/cherny/miniconda3/envs/text/bin/textkit", line 11, in <module>
    sys.exit(cli())
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/textkit/tokenize/words.py", line 12, in text2words
    tokens = nltk.word_tokenize(content)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/Users/cherny/miniconda3/envs/text/lib/python3.5/site-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource 'tokenizers/punkt/PY3/english.pickle' not found.
  Please use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/Users/cherny/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

vlandham commented 7 years ago

thanks for the report! I think we might just be missing more checks for when the data is downloaded - and need to put up better error messages (instead of just blowing up)?

I'll try a clean install as well - to try to better figure out where additional error handling is needed.

vlandham commented 7 years ago

AH. i misunderstood your error message. sorry! yes - it looks like we also need to add that punkt to the downloads. Thanks!

vlandham commented 7 years ago

Ok. i was able to recreate your issue with python 2 and python 3.

45 includes `punkt` in the downloads list. It also adds slightly more informative (?) error handling to the tokenizer commands.

I've updated the version to 0.2.2 and published this version on PyPi. I also tested locally to verify this fixes the issue for me on python 2 and python 3.

I think this is now resolved - but if you have a second to try it out on your machine - i can leave the issue open till its confirmed to be working for you.

Thanks again!

arnicas commented 7 years ago

I just tried in in a new conda venv - saw the message "looking for punkt" so I imagine it's good now. Thanks!

learntextvis / textkit

Bug with textkit downloads? missing file? #44

45 includes `punkt` in the downloads list. It also adds slightly more informative (?) error handling to the tokenizer commands.

learntextvis / textkit

Bug with textkit downloads? missing file? #44

45 includes punkt in the downloads list. It also adds slightly more informative (?) error handling to the tokenizer commands.

45 includes `punkt` in the downloads list. It also adds slightly more informative (?) error handling to the tokenizer commands.