stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.28k stars 892 forks source link

Stopwords/punctuation removal #346

Closed lorellav closed 4 years ago

lorellav commented 4 years ago

Hi, We would like to use Stanza to do the pre-processing stages including stopwords/punctuation/special characters removal. We noticed that this step does not seem to be part of the pipeline. We are then performing this step with NLTK, but we found out that loading only the Stanza lemmatization processor afterwards performs very slowly. Is there a workaround to this issue? For instance, with spaCy there is a workaround to do just that. Please see below.

it_nlp = it_core_news_sm.load(disable=['tagger', 'parser', 'ner'])
# lemmatization function
def lemmatize(doc):
  lemmatized_doc = []
  for w in doc:
    w_lemma = [token.lemma_ for token in it_nlp(w)]
    lemmatized_doc.append(w_lemma[0])
  return lemmatized_doc

# add column with lemmatized tokens
sources['tokens_lemmatized'] = sources['tokens_prep_nostop'].apply(lambda x: lemmatize(x))

process used with Stanza:
nlp = stanza.Pipeline(lang='it', processors='tokenize,mwt,pos,lemma', tokenize_pretokenized=True)
# lemmatization function
def lemmatize(doc):
  nlpd_doc = nlp([doc])
  output = []
  for i, sentence in enumerate(nlpd_doc.sentences):
    w_lemma = [word.lemma for word in sentence.words]
    output.append(w_lemma)
  return output
(if I try it on one doc, it works fine):
example_doc = sources['doc_prep_nostop'].iloc[0]
test_lemma = lemmatize(example_doc)

on the entire sample data, works fine but needs 5 to 10 minutes to run vs a few seconds in Spacy

sources['doc_lemmatized'] = sources['doc_prep_nostop'].apply(lambda x: lemmatize(x))

Many thanks, Lorella.

qipeng commented 4 years ago

One of the reasons is that Stanza's lemmatizer involves an accurate neural model and SpaCy's is a dictionary look up last time I checked. So the former is bound to be slower. Also, if you're just filtering out stop words and special symbols, do you actually need the POS tagger and lemmatizer to be run as well?

lorellav commented 4 years ago

No, exactly, we don't. We need tokenise, remove stopwords, punctuation & special characters, and lemmatiser. Is there a way to disable/exclude the POS in Stanza? Thanks.

qipeng commented 4 years ago

@lorellav If you need to lemmatize it's actually not possible to remove the tagger unfortunately. The lemmatizer depends on POS information.

lorellav commented 4 years ago

Thank you, we thought so, but just wanted to be extra sure. All the best, Lorella.