Open leotok opened 4 years ago
Hi @leotok, thank you for reaching out!
Which preprocessing
functions do you need?
I'm not sure it will be possible to find a solution. Probably, very soon we will improve the tokenization
algorithm and instead of simply using a regex, we will use spaCy
, that means that even for the preprocessing
module, it might be necessary to download the en_core_web_sm
.
"Every time a new API instance is started", that the same as saying every time the instance install texthero? pip install texthero
?
I'm open to new suggestions, how would you solve this?
Regards,
Hi @jbesomi, thanks for the quick response!
Currently, I'm only using these functions that I believe shouldn't require any outside resources:
custom_pipeline = [
preprocessing.remove_whitespace,
preprocessing.remove_punctuation,
preprocessing.remove_diacritics,
preprocessing.remove_html_tags,
preprocessing.lowercase,
]
text = hero.clean(text, custom_pipeline)
I found a way to overcome this issue by adding all packages downloads on my dockerfile
, preventing the API from downloading them "on-demand".
RUN python3.6 -m spacy download en_core_web_sm
RUN python3.6 -m nltk.downloader stopwords
"Every time a new API instance is started", that the same as saying every time the instance install texthero? pip install texthero?
The texthero package was already installed in the docker image, but the spacy and nltk resources were not. Now it's not a problem anymore with this new approach.
Maybe there could be a hint about this somewhere in the README.md, what do you think?
Also, now that I solved this installation issue, I'm facing a new problem. Although I use just a small (but great!) fraction of the lib, my docker container got a lot bigger because of these "unused" models and I had to increase the memory reserved for my instances inside kubernetes.
I understand that there is a dependency because of the tokenization
and stopwords
modules from both spacy and nltk, but maybe its download verification could be triggered only when a function that uses these resources are imported. Maybe there could be a function decorator that could check if some resource is already installed and install it if needed.
Something like this:
# preprocessing.py
@needs(lib='spacy', resource='en_core_web_sm')
def tokenize(s: pd.Series) -> pd.Series:
# ... does tokenization
@needs(lib='nltk', resource='stopwords')
def replace_stopwords(
s: pd.Series, symbol: str, stopwords: Optional[Set[str]] = None
) -> pd.Series:
# ... replace stopwords
This needs
decorator would run at import time and check if the resource is present and download it. This way it wouldnt be neeeded to download resources just like in the stopwords.py
file.
Another suggesting is that preprocessing
could become a directory and functions that require downloaded resources could be inside its own modules with the download verification inside each module. This way the download would only be triggered when importing modules that need them.
## preprocessing/tokenize.py
# tries to download resources here
def tokenize(s: pd.Series) -> pd.Series:
pass
## preprocessing/stopwords.py
# tries to download resources here
def replace_stopwords(s: pd.Series) -> pd.Series:
pass
## preprocessing/regex_based.py
# **doesnt** need to download resources here
def remove_whitespace(s: pd.Series) -> pd.Series:
pass
What do you think about this suggestions?
Thanks for helping out!
Hey Leo,
You are making great observations, thank you for sharing!
I agree with you that sometimes is inefficient (and also annoying) that the lib download resources not strictly necessary. Both of your solutions seem interesting to me; if you want to investigate this further I will be happy to review them.
Some extra thoughts:
nltk:
We might want to get rid completely of nltk
. Some users got some troubles in using it (see for instance #32). As Texthero only requires the list of nltk stopwords, we can save them in a file/python list and remove the nltk
dependency.
spaCy models:
We are realizing more and more that the simplest and elegant way to actually preprocess text data is by first tokenizing the input. Tokenization is therefore crucial (and one of the least considered part of the preprocessing phase imo). The current tokenization algo is based on a simple regular expression but is accuracy is poor. We were seriously considering switching to a more robust solution with spaCy (see #131).
We are also considering having all preprocessing
function at the exception of tokenize
to receive as input an already tokenized Series (TokenSeries). This brings some advantages, especially when dealing with non-Western languages (see for instance our comments in #128 and #84 )
In this case, download the spaCy model will be required in 90% of the cases.
Hi,
I want to use Texthero to preprocess text on my API, but every time a new instance of the API is started, it has to download
en-core-web-sm
from spaCy even though I'm only using the preprocessing module.Is there a way to avoid this download?
Thanks!
Ex:
Running the API: