Avoid downloading spaCy models when only using the preprocessing module

leotok commented 4 years ago

Hi,

I want to use Texthero to preprocess text on my API, but every time a new instance of the API is started, it has to download en-core-web-sm from spaCy even though I'm only using the preprocessing module.

Is there a way to avoid this download?

Thanks!

Ex:

from texthero import preprocessing # <-- shoud this import download all models?

Running the API:

Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... -  Building wheel for en-core-web-sm (setup.py) ... done
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-py3-none-any.whl size=12047105 sha256=235a27e79e49d482c125ef1e64f1bf1082ecdf6b2e46bd6da2a771f66fb21410
  Stored in directory: /tmp/pip-ephem-wheel-cache-svqxhr81/wheels/10/6f/a6/ddd8204ceecdedddea923f8514e13afb0c1f0f556d2c9c3da0
Successfully built en-core-web-sm
done
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.3.1
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')

jbesomi commented 4 years ago

Hi @leotok, thank you for reaching out!

Which preprocessing functions do you need?

I'm not sure it will be possible to find a solution. Probably, very soon we will improve the tokenization algorithm and instead of simply using a regex, we will use spaCy, that means that even for the preprocessing module, it might be necessary to download the en_core_web_sm.

"Every time a new API instance is started", that the same as saying every time the instance install texthero? pip install texthero?

I'm open to new suggestions, how would you solve this?

Regards,

leotok commented 4 years ago

Hi @jbesomi, thanks for the quick response!

Currently, I'm only using these functions that I believe shouldn't require any outside resources:

custom_pipeline = [
    preprocessing.remove_whitespace,
    preprocessing.remove_punctuation,
    preprocessing.remove_diacritics,
    preprocessing.remove_html_tags,
    preprocessing.lowercase,
]
    text = hero.clean(text, custom_pipeline)

I found a way to overcome this issue by adding all packages downloads on my dockerfile, preventing the API from downloading them "on-demand".

RUN python3.6 -m spacy download en_core_web_sm
RUN python3.6 -m nltk.downloader stopwords

"Every time a new API instance is started", that the same as saying every time the instance install texthero? pip install texthero?

The texthero package was already installed in the docker image, but the spacy and nltk resources were not. Now it's not a problem anymore with this new approach.

Maybe there could be a hint about this somewhere in the README.md, what do you think?

Also, now that I solved this installation issue, I'm facing a new problem. Although I use just a small (but great!) fraction of the lib, my docker container got a lot bigger because of these "unused" models and I had to increase the memory reserved for my instances inside kubernetes.

I understand that there is a dependency because of the tokenization and stopwords modules from both spacy and nltk, but maybe its download verification could be triggered only when a function that uses these resources are imported. Maybe there could be a function decorator that could check if some resource is already installed and install it if needed.

Something like this:

# preprocessing.py

@needs(lib='spacy', resource='en_core_web_sm')
def tokenize(s: pd.Series) -> pd.Series:
    # ... does tokenization

@needs(lib='nltk', resource='stopwords')
def replace_stopwords(
    s: pd.Series, symbol: str, stopwords: Optional[Set[str]] = None
) -> pd.Series:
    # ... replace stopwords

This needs decorator would run at import time and check if the resource is present and download it. This way it wouldnt be neeeded to download resources just like in the stopwords.py file.

Another suggesting is that preprocessing could become a directory and functions that require downloaded resources could be inside its own modules with the download verification inside each module. This way the download would only be triggered when importing modules that need them.

## preprocessing/tokenize.py

# tries to download resources here
def tokenize(s: pd.Series) -> pd.Series:
    pass

## preprocessing/stopwords.py

# tries to download resources here
def replace_stopwords(s: pd.Series) -> pd.Series:
    pass

## preprocessing/regex_based.py

# **doesnt** need to download resources here
def remove_whitespace(s: pd.Series) -> pd.Series:
    pass

What do you think about this suggestions?

Thanks for helping out!

jbesomi commented 4 years ago

Hey Leo,

You are making great observations, thank you for sharing!

I agree with you that sometimes is inefficient (and also annoying) that the lib download resources not strictly necessary. Both of your solutions seem interesting to me; if you want to investigate this further I will be happy to review them.

Some extra thoughts:

nltk:

We might want to get rid completely of nltk. Some users got some troubles in using it (see for instance #32). As Texthero only requires the list of nltk stopwords, we can save them in a file/python list and remove the nltk dependency.

spaCy models:

We are realizing more and more that the simplest and elegant way to actually preprocess text data is by first tokenizing the input. Tokenization is therefore crucial (and one of the least considered part of the preprocessing phase imo). The current tokenization algo is based on a simple regular expression but is accuracy is poor. We were seriously considering switching to a more robust solution with spaCy (see #131).

We are also considering having all preprocessing function at the exception of tokenize to receive as input an already tokenized Series (TokenSeries). This brings some advantages, especially when dealing with non-Western languages (see for instance our comments in #128 and #84 )

In this case, download the spaCy model will be required in 90% of the cases.

jbesomi / texthero

Avoid downloading spaCy models when only using the preprocessing module #120