Performance of NLP modules

mnemonic-no / act-scio2

Scio v2 is a reimplementation of Scio in Python3

ISC License

4 stars 5 forks source link

This is more a suggestion than an actual issue, but I see that much of the processing time is spent on running NLP modules from NLTK. I would actually recommend to drop NLTK altogether -- it's now a quite outdated piece of software, and it was never developed for processing large document bases (it was actually primarily meant for educational purposes, as part of NLP courses).

I am a big fan of spacy (www.spacy.io). It's very fast, reliable, contains all the standard NLP modules you might need (tokeniser, POS tagger, lemmatiser, NER, parser, etc.) and has a developer-friendly interface. The accuracy of these NLP modules is also definitely better than NLTK (spacy relies on deep learning models). Plus, spacy it now supports a dozen languages or so. Give it a try and you will never look back at NLTK :-)

mnemonic-no / act-scio2

Performance of NLP modules #24