Closed arademaker closed 6 years ago
In fact, most of that is in core: https://github.com/vseloved/cl-nlp/blob/master/src/packages.lisp#L115 (implementation here: https://github.com/vseloved/cl-nlp/blob/master/src/core/ngrams.lisp). Although there might have been some slight alterations to the API used in the blog post.
There's also a small contrib package to work with Microsoft ngrams service (although, I'm not sure whether it's still operational, as such companies have a tendency to frequently change their APIs in backward incompatible ways).
Speaking about computing ngrams one sentence per line, indeed there're no utilities like that in cl-nlp. However, it's quite easy to implement - for example, here's one possible code snippet:
(defun compute-ngrams (files &key (order 2))
(let ((ngrams (make-hash-table :test 'equal)))
(dolist (file files)
(rutils:dolines (line file)
(loop :for tail :on (nlp:tokenize nlp:<word-tokenizer> line)
:while (nth order tail) :do
(incf (gethash (format nil "~{~A~^ ~}" (subseq tail order))
ngrams 0)))))
(make-instance 'nlp:table-ngrams :order order
:table ngrams)))
As far as I understood, the functions for computing ngrams are not in the core.nlp package yet, right? I found them only in the nltk files:
http://lisp-univ-etc.blogspot.com.br/2013/02/nltk-13-computing-with-language-simple.html
Do you have already a function for computing ngrams in a given set of text files (one sentence per line)?