mnemonic-no / act-scio2

Scio v2 is a reimplementation of Scio in Python3
ISC License
4 stars 5 forks source link

Performance of NLP modules #24

Closed plison closed 4 years ago

plison commented 4 years ago

This is more a suggestion than an actual issue, but I see that much of the processing time is spent on running NLP modules from NLTK. I would actually recommend to drop NLTK altogether -- it's now a quite outdated piece of software, and it was never developed for processing large document bases (it was actually primarily meant for educational purposes, as part of NLP courses).

I am a big fan of spacy (www.spacy.io). It's very fast, reliable, contains all the standard NLP modules you might need (tokeniser, POS tagger, lemmatiser, NER, parser, etc.) and has a developer-friendly interface. The accuracy of these NLP modules is also definitely better than NLTK (spacy relies on deep learning models). Plus, spacy it now supports a dozen languages or so. Give it a try and you will never look back at NLTK :-)

frbor commented 4 years ago

Hi Pierre,

we evaluated both NLTK and spacy early on in this rewrite of scio. I can not say that we found any differences in the accuracy of the testing we did, and speed has not been a major concern to us, neither has the support for languages other than english.

So at the moment we do not plan to switch out this component, but it can of course be done in the future, either as an addition or as a replacement, and it is also possible for others to look into this.

best regards,

Fredrik