vseloved / cl-nlp

Common Lisp NLP toolset
Other
219 stars 28 forks source link

ngrams #31

Closed arademaker closed 6 years ago

arademaker commented 6 years ago

As far as I understood, the functions for computing ngrams are not in the core.nlp package yet, right? I found them only in the nltk files:

http://lisp-univ-etc.blogspot.com.br/2013/02/nltk-13-computing-with-language-simple.html

Do you have already a function for computing ngrams in a given set of text files (one sentence per line)?

vseloved commented 6 years ago

In fact, most of that is in core: https://github.com/vseloved/cl-nlp/blob/master/src/packages.lisp#L115 (implementation here: https://github.com/vseloved/cl-nlp/blob/master/src/core/ngrams.lisp). Although there might have been some slight alterations to the API used in the blog post.

There's also a small contrib package to work with Microsoft ngrams service (although, I'm not sure whether it's still operational, as such companies have a tendency to frequently change their APIs in backward incompatible ways).

vseloved commented 6 years ago

Speaking about computing ngrams one sentence per line, indeed there're no utilities like that in cl-nlp. However, it's quite easy to implement - for example, here's one possible code snippet:

(defun compute-ngrams (files &key (order 2))
  (let ((ngrams (make-hash-table :test 'equal)))
    (dolist (file files)
      (rutils:dolines (line file)
        (loop :for tail :on (nlp:tokenize nlp:<word-tokenizer> line)
              :while (nth order tail) :do
                (incf (gethash (format nil "~{~A~^ ~}" (subseq tail order))
                               ngrams 0)))))
    (make-instance 'nlp:table-ngrams :order order
                                     :table ngrams)))