sloria / TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
https://textblob.readthedocs.io/
MIT License
9.18k stars 1.15k forks source link

Textblob not finding the downloaded corpora #474

Open cagan-elden opened 2 months ago

cagan-elden commented 2 months ago

python -m textblob.download_corpora

Although I download the corpora as said in the error message it still does not work. I ain't sure is it because of the NLTK library or not because I've installed that too.

doctorsketch commented 2 months ago

I found upgrading from NLTK 3.8.1 to 3.9.1 broke my project. I now get errors asking me to:

python -m textblob.download_corpora

Previously you could download textblob corpora on one account and it could be found by another account. This is no longer the case.

Moving back to NLTK 3.8.1 fixed it. I can reproduce the issue by upgrading to 3.9.1 again.

Ajaychaki2004 commented 2 weeks ago

The problem is due the version moving back to the NLTK 3.8.1 can help to rectify the error

doctorsketch commented 1 week ago

To follow up on this, I fixed it by specifying the NLTK data path and telling NLTK where to look like this:

def download_nltk_resources(self):
    """
    Downloads required NLTK resources if not already present.
    """
    import nltk
    import os

    # Use the environment variable or fall back to default
    nltk_data_path = os.getenv('NLTK_DATA', '/usr/local/share/nltk_data')

    # Ensure the directory exists
    os.makedirs(nltk_data_path, exist_ok=True)

    # Add our path to NLTK's data path
    nltk.data.path.insert(0, nltk_data_path)

    print(f"Using NLTK data path: {nltk_data_path}")

    required_resources = {
        'averaged_perceptron_tagger': ('taggers', 'averaged_perceptron_tagger'),
        'averaged_perceptron_tagger_eng': ('taggers', 'averaged_perceptron_tagger_eng'),
        'punkt': ('tokenizers', 'punkt'),
        'punkt_tab': ('tokenizers/punkt_tab', 'english'),
        'movie_reviews': ('corpora', 'movie_reviews'),
        'brown': ('corpora', 'brown'),
        'conll2000': ('corpora', 'conll2000'),
        'wordnet': ('corpora', 'wordnet')
    }

    # Download and verify all resources
    for resource, (folder, name) in required_resources.items():
        try:
            nltk.data.find(f'{folder}/{name}')
        except LookupError:
            print(f"Downloading {resource}...")
            nltk.download(resource, download_dir=nltk_data_path, quiet=True)

with NLTK_DATA specified as an environment variable.

Then do something like this:

try:
    # Download resources only once at the start
    if not hasattr(TextParser, '_resources_checked'):
        self.download_nltk_resources()
        TextParser._resources_checked = True