vseloved / cl-nlp

Common Lisp NLP toolset
Other
219 stars 28 forks source link

Following the instructions on writing a POS tagger results in error on text-tokens (CCL::UNDEFINED-FUNCTION-CALL). #26

Closed degyves closed 8 years ago

degyves commented 9 years ago

On docs/user-guide/examples/eng-pos-tagger.md are given some instructions that fail:

The following code:

NLP> (let ((words-dist #h(equal))
       (map-corpus :ptb-tagged (corpus-file "ptb/TAGGED/POS/WSJ")
                   #`(dolist (sent (text-tokens %))
                       (dolist (tok sent)
                         (unless (in# (token-word tok) words-dist)
                           (:= (get# (token-word tok) words-dist) #h()))
                         (:+ (get# (token-tag tok)
                                   (get# (token-word tok) words-dist)
                                   0))))
                   :ext "POS")
       words-dist)
#<HASH-TABLE :TEST EQUAL :COUNT 51457 {10467E6543}>
NLP> (reduce #'+ (mapcan #'ht-vals (ht-vals *)))
1289201

... apears to be two separate forms: the let and the reduce form.

NLP> (let ((words-dist #h(equal)))
       (map-corpus :ptb-tagged (corpus-file "ptb/TAGGED/POS/WSJ")
                   #`(dolist (sent (text-tokens %))
                       (dolist (tok sent)
                         (unless (in# (token-word tok) words-dist)
                           (:= (get# (token-word tok) words-dist) #h()))
                         (:+ (get# (token-tag tok)
                                   (get# (token-word tok) words-dist)
                                   0))))
                   :ext "POS")
       words-dist)

The first error is that there is no file WSJ under corpora/ptb/TAGGED/POS/

But if we change it to an existing corpora under corpora/, as "onf-wsj":

NLP> (let ((words-dist #h(equal)))
       (map-corpus :ptb-tagged (corpus-file "onf-wsj")
                   #`(dolist (sent (text-tokens %))
                       (dolist (tok sent)
                         (unless (in# (token-word tok) words-dist)
                           (:= (get# (token-word tok) words-dist) #h()))
                         (:+ (get# (token-tag tok)
                                   (get# (token-word tok) words-dist)
                                   0))))
                   :ext "POS")
       words-dist)

Then CCL:UNDEFINED-FUNCTION-CALL is spawned. There is no such function.

Any clues?

I'm using Clozure Common Lisp 1.10. Under SBCL, it made a thread-error by just running the first let. Using Windows 8 64-bit.

vseloved commented 9 years ago

The problem with the Penn Treebank corpus is that it's a proprietary one, so, unfortunately, I can't include it with cl-nlp - you need to get it on your own. Here's a link to the original release: https://catalog.ldc.upenn.edu/LDC99T42. This one is really expensive, however, recently there was an updated release that is much more affordable: https://catalog.ldc.upenn.edu/LDC2015T13 (I haven't looked at it yet, so I don't know if there are any changes to the format).

Now, Ontonotes (the source of onf-wsj) doesn't provide data in the same tagged format as the Penn Treebank, so it doesn't make sense to use (map-corus :ptb-tagged ...) with it. You can see and example of the tagged Penn Treebank representation here: https://github.com/vseloved/cl-nlp/blob/master/corpora/samples/WSJ_0001.POS

Finally, speaking about the undefined error function, the docs should be updated to reflect the chnages made during the recent refactoring: basically, the function text-tokens is not called text-tokenized, and the internal structure has changed to a list of lists of lists (paragraph-sentence-tokens 3-level structure).

degyves commented 9 years ago

Also, is the form unbalanced in the docs?

vseloved commented 9 years ago

Yes, that is correct, thanks! Fixed.