heroku / heroku-buildpack-python

Heroku's buildpack for Python applications.
https://www.heroku.com/python
MIT License
974 stars 1.84k forks source link

Error downloading NLTK corpora #444

Closed daegloe closed 7 years ago

daegloe commented 7 years ago

Hi, we are no longer able to download and install the NLTK corpora via this build pack.

We receive the deployment error below, no matter the packages listed in the nltk.txt file:

remote: -----> Downloading NLTK corpora...
remote: -----> Downloading NLTK packages: wordnet brown punkt averaged_perceptron_tagger 
remote:        [nltk_data] Error loading wordnet brown punkt
remote:        [nltk_data]     averaged_perceptron_tagger : Package 'wordnet brown
remote: Traceback (most recent call last):
remote:   File "/app/.heroku/python/lib/python2.7/runpy.py", line 174, in _run_module_as_main
remote:     "__main__", fname, loader, pkg_name)
remote:   File "/app/.heroku/python/lib/python2.7/runpy.py", line 72, in _run_code
remote:     exec code in run_globals
remote:        [nltk_data]     punkt averaged_perceptron_tagger ' not found in index
remote:        Error installing package. Retry? [n/y/e]
remote:   File "/tmp/build_5f9f482255a2eae0e0f03f81ed130b83/.heroku/python/lib/python2.7/site-packages/nltk/downloader.py", line 2267, in <module>
remote:     halt_on_error=options.halt_on_error)
remote:   File "/tmp/build_5f9f482255a2eae0e0f03f81ed130b83/.heroku/python/lib/python2.7/site-packages/nltk/downloader.py", line 675, in download
remote:     choice = compat.raw_input().strip()
remote: EOFError: EOF when reading a line
cclauss commented 7 years ago

The second last line is asking you to enter text on the keyboard but since you are not there to do so, it is dying on no input from the user.

Is there a way to echo "y" into this process or run it with --yes or --force, etc.?

kennethreitz commented 7 years ago

interesting — can you try downgrading your nltk package to an earlier version that doesn't ask?

kennethreitz commented 7 years ago

we may be able to use expect to automate this, if there isn't a flag for automated installs.

daegloe commented 7 years ago

Thanks for following up.

If you look closely at the log snippet, the install fails and then prompts to retry. The question is why does the install fail on Heroku, but succeed locally?

FWIW, we haven't upgraded our NLTK version. This was working with the build pack until recently.

Could an infrastructure change be the culprit?

daegloe commented 7 years ago

It seems to be ignoring the line breaks when reading the packages listed in the nltk.txt file.

kennethreitz commented 7 years ago

@daegloe are you on windows?

daegloe commented 7 years ago

No, Mac OS Sierra.

edmorley commented 7 years ago

The buildpack converts nltk.txt to the space separated package list using: https://github.com/heroku/heroku-buildpack-python/blob/fd360bda149687536b26ac413185a6c2b5953bca/bin/steps/nltk#L24

Is your file using CRLFs or CRs instead? What does tr "\n" " " < nltk.txt show locally?

edmorley commented 7 years ago

Whilst the Error installing package. Retry? [n/y/e] prompt seen in the OP isn't the cause of the problem, it is confusing and also means that retries won't be happening in the case of transient network issues.

It turns out it can be suppressed by using the quiet option, however that means losing out on the standard log output. Ideally if no TTY is attached the prompt would be skipped - so I've filed nltk/nltk#1825.

cclauss commented 7 years ago

@edmorley Would https://github.com/heroku/heroku-buildpack-python/issues/444#issuecomment-320466015 work?

edmorley commented 7 years ago

No since in this case the package names being passed to the downloader are invalid, so retrying wouldn't help.

kennethreitz commented 7 years ago

@edmorley thanks for filing that ticket!

edmorley commented 7 years ago

Ah actually the problem isn't due to the line endings, it's that the buildpack passes the package list quoted. Doing so works when there is only one package, but not when there are several.

Compare:

$ python2 -m nltk.downloader -d nltk_data "wordnet brown punkt averaged_perceptron_tagger"
[nltk_data] Error loading wordnet brown punkt
[nltk_data]     averaged_perceptron_tagger: Package 'wordnet brown
[nltk_data]     punkt averaged_perceptron_tagger' not found in index
Error installing package. Retry? [n/y/e]

to:

$ python2 -m nltk.downloader -d nltk_data wordnet brown punkt averaged_perceptron_tagger
[nltk_data] Downloading package wordnet to nltk_data...
...

To fix, the double quotes added by #438 need to be removed.

edmorley commented 7 years ago

I've opened #460 which will fix this. In the meantime you could roll back to release v110 (whose git tag is unfortunately missing, #448) or else v109 using the syntax here: https://devcenter.heroku.com/articles/buildpacks#buildpack-references

(If you choose to do that, I'd recommend switching back to the heroku/python buildpack alias later, so you continue to receive updates)

edmorley commented 7 years ago

@daegloe could you confirm that #460 fixed this? It's present in any tagged version from 114 onwards, and also via the heroku/python stable version alias.

daegloe commented 7 years ago

yes, all good now. thanks @edmorley!

kennethreitz commented 7 years ago

✨🍰✨

rchsmohsin commented 6 years ago

Hey, I'm having a very similar issue to the one originally described here. -----> Python app detected -----> Installing requirements with pip -----> Downloading NLTK corpora… -----> Downloading NLTK packages: punkt /app/.heroku/python/lib/python3.6/runpy.py:125: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour warn(RuntimeWarning(msg)) Traceback (most recent call last): File "/app/.heroku/python/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/app/.heroku/python/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/app/.heroku/python/lib/python3.6/site-packages/nltk/downloader.py", line 2272, in <module> [nltk_data] Error loading punkt : Package 'punkt\r' not found in index Error installing package. Retry? [n/y/e] halt_on_error=options.halt_on_error) File "/app/.heroku/python/lib/python3.6/site-packages/nltk/downloader.py", line 681, in download choice = input().strip() EOFError: EOF when reading a line -----> Discovering process types Procfile declares types -> worker -----> Compressing... Done: 47.2M -----> Launching... Released v42 https://phoenix-discordbot-test.herokuapp.com/ deployed to Heroku

For background, I'm using nltk for TextBlob on Heroku. I have a nltk.txt file in my app folder that I'm using in Dropbox, and I specified both nltk and textblob in my requirements.txt file. I'm not super knowledgeable about this, but if someone could help me out it would be much appreciated.

The corpora that I want to be installed are: punkt brown wordnet averaged_perceptron_tagger conll2000 movie_reviews

and I have listed these in my nltk.txt file.

edmorley commented 6 years ago

It looks like your nltk.txt has windows line endings, which aren't currently handled gracefully by this buildpack's nltk step. I'd recommend converting it to unix endings (and for the rest of the repository too; generally windows line endings are not recommended since they break bash and others). See: https://help.github.com/articles/dealing-with-line-endings/

YaxianLin commented 5 years ago

Hi, I am sorry to respond to a closed issue. However, I got an issue similar to rchsmohsin while only use 1 NLTK corpus.

Downloading NLTK corpora…
remote: -----> Downloading NLTK packages: averaged_perceptron_tagger
remote: /app/.heroku/python/lib/python3.6/runpy.py:125: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
remote:   warn(RuntimeWarning(msg))
remote: [nltk_data] Downloading package averaged_perceptron_tagger to /tmp/bui
remote: [nltk_data]     ld_dcb72ad4d4a6e9c82f1c784207ff830e/.heroku/python/nlt
remote: [nltk_data]     k_data...
remote: [nltk_data]   Package averaged_perceptron_tagger is already up-to-
remote: [nltk_data]       date!

For background, I am creating a word cloud without verbs so that only 'averaged_perceptron_tagger' in NLTK is needed. The word cloud image doesn't show up on the app so that I am wondering this error may be the reason.

RaviGprec commented 4 years ago

Sorry for commenting on the closed issue. I'm still facing the issue with nltk installation onto heroku.

@daegloe , @YaxianLin what is exact fix for this problem?

reynoldyehezkiel commented 3 years ago

Sorry for commenting on the closed issue. I'm still facing the issue with nltk installation onto heroku.

@daegloe , @YaxianLin what is exact fix for this problem?

did you find the solution?

nbthinh commented 3 years ago

When I deploy python web to heroku, it have error, could you help me? -----> Downloading NLTK corpora… ! 'nltk.txt' not found, not downloading any corpora ! Learn more: https://devcenter.heroku.com/articles/python-nltk -----> Discovering process types Procfile declares types -> web -----> Compressing... ! Compiled slug size: 1.4G is too large (max is 500M). ! See: http://devcenter.heroku.com/articles/slug-size ! Push failed

harsh204016 commented 3 years ago

I have made a stopwords.csv and using in my code import nltk import pandas as pd nltk.download('stopwords') from nltk.corpus import stopwords stopword = pd.DataFrame() stopword['text'] = stopwords.words('english') stopword.to_csv("stopwords.csv")

but this is solution for only stopwords as I am facing issue of

[nltk_data] /tmp/build_3863ea7f/.heroku/python/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip. -----> Discovering process types Procfile declares types -> web -----> Compressing... ! Compiled slug size: 644.5M is too large (max is 500M). ! See: http://devcenter.heroku.com/articles/slug-size ! Push failed