Open IsaacHaze opened 10 years ago
isaac@u024529 [master] git grep nltk
README.md: * nltk
semanticizer/processors/semanticize.py:from nltk import regexp_tokenize
semanticizer/processors/semanticize.py:from nltk.util import ngrams as nltk_ngrams
semanticizer/processors/semanticize.py: for ngram in nltk_ngrams(token_list, n):
semanticizer/processors/semanticizer.py:from nltk.tokenize.punkt import PunktSentenceTokenizer
isaac@u024529 [master]
Yes, would be nice to remove this dependency, but the last point is indeed not that easy to remove. We do want to support longer documents, so we need some kind of sentence splitting built in. Unless of course @larsmans comes up with his fancy new super fast matching algorithm...
He has a fancy super fast matching algorithm? I thought he did a levenshtein implementation?
He can multitask and said he wanted to do something about the matching today…
+1
Did my big mouth speak for itself again?
Currently we pull in nltk for doing:
The first two point are easy, the last one can (should?) be made optional (if you're dealing with document and want to split them into sentences.)