jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.88k stars 240 forks source link

tokenize with Spacy #131

Open jbesomi opened 4 years ago

jbesomi commented 4 years ago

The actual tokenizer is very fast as it uses a simple regex pattern but at the same time very imprecise.

A better alternative might be to make use of spaCy.

Naively tokenize a Pandas Series with spaCy is very simple:

def tokenize_with_spacy(s: pd.Series) -> pd.Series:

    nlp = spacy.load("en_core_web_sm")# disable=["ner", "tagger", "parser"])

    tokenized = []
    for doc in nlp.pipe(s):
        tokenized.append(list(map(str, doc)))

    return pd.Series(tokenized, index=s.index)

This should be somehow fast as in theory spaCy makes use of multi-threading.

The reason we haven't implemented this yet is that we want to make sure this solution is enough faster. We want to provide a simple tool to analyze a quite large amount of text data; say 100k Pandas Row should take no longer than 15-30 seconds to tokenize ... ?

For now, the task consists in:

  1. Compare the "spacy solution" w.r.t the actual version and benchmark the function on large datasets (150k rows or so)
    1. This should be done in a single, clean notebook that will be shared there
  2. Evaluate if we can do better by parallelizing even more the process (we can probably parallelize both by row and by sentence?)
jbesomi commented 4 years ago

Dask vs. spaCy

It's faster to use pipe from spaCy or to directly use Dask (Dask Dataframe)?

Dask might be the solution we were looking for ...

mk2510 commented 4 years ago

As described in #162 , dask is not feasible from a UX perspective. Here are our results from experimenting with tokenize. See THE ATTACHED PDF for a notebook of the results.

Speed Comparison

We now compare:

  1. current implementation without parallelization
  2. current implementation with parallelization (see #162)
  3. tokenize_with_spacy with spacy built-in parallelization through n_process
  4. tokenize_with_spacy with our custom parallelization

Results below.

We can see that

Thus, our options:

  1. keep everything as proposed in #162 (-> multiprocessing applied to current solution)
  2. option 1, but we give users a parameter use_spacy that works like our tokenize_with_spacy_own_parallelization above, and explain to them that this might give them better results but takes about 3x as long.

We don't really have a preference.