Remove (or make optional) nltk dependency

semanticize / semanticizer

Entity Linking for the masses

http://semanticize.uva.nl/

GNU General Public License v3.0

56 stars 15 forks source link

Remove (or make optional) nltk dependency #33

Open IsaacHaze opened 10 years ago

IsaacHaze commented 10 years ago

Currently we pull in nltk for doing:

tokenizing text with a regexp
generating ngrams from a list
PunktSentenceTokenizer

The first two point are easy, the last one can (should?) be made optional (if you're dealing with document and want to split them into sentences.)

IsaacHaze commented 10 years ago

isaac@u024529 [master] git grep nltk
README.md:   * nltk
semanticizer/processors/semanticize.py:from nltk import regexp_tokenize
semanticizer/processors/semanticize.py:from nltk.util import ngrams as nltk_ngrams
semanticizer/processors/semanticize.py:                for ngram in nltk_ngrams(token_list, n):
semanticizer/processors/semanticizer.py:from nltk.tokenize.punkt import PunktSentenceTokenizer
isaac@u024529 [master]

dodijk commented 10 years ago

Yes, would be nice to remove this dependency, but the last point is indeed not that easy to remove. We do want to support longer documents, so we need some kind of sentence splitting built in. Unless of course @larsmans comes up with his fancy new super fast matching algorithm...

IsaacHaze commented 10 years ago

He has a fancy super fast matching algorithm? I thought he did a levenshtein implementation?

dodijk commented 10 years ago

He can multitask and said he wanted to do something about the matching today…

IsaacHaze commented 10 years ago

larsmans commented 10 years ago

Did my big mouth speak for itself again?