Closed lorellav closed 4 years ago
One of the reasons is that Stanza's lemmatizer involves an accurate neural model and SpaCy's is a dictionary look up last time I checked. So the former is bound to be slower. Also, if you're just filtering out stop words and special symbols, do you actually need the POS tagger and lemmatizer to be run as well?
No, exactly, we don't. We need tokenise, remove stopwords, punctuation & special characters, and lemmatiser. Is there a way to disable/exclude the POS in Stanza? Thanks.
@lorellav If you need to lemmatize it's actually not possible to remove the tagger unfortunately. The lemmatizer depends on POS information.
Thank you, we thought so, but just wanted to be extra sure. All the best, Lorella.
Hi, We would like to use Stanza to do the pre-processing stages including stopwords/punctuation/special characters removal. We noticed that this step does not seem to be part of the pipeline. We are then performing this step with NLTK, but we found out that loading only the Stanza lemmatization processor afterwards performs very slowly. Is there a workaround to this issue? For instance, with spaCy there is a workaround to do just that. Please see below.
on the entire sample data, works fine but needs 5 to 10 minutes to run vs a few seconds in Spacy
Many thanks, Lorella.