turbopape / postagga

A Library to parse natural language in pure Clojure and ClojureScript
MIT License
159 stars 16 forks source link

Improvement of Viterbi algorithm used in tagger.cljc #28

Closed attil-io closed 6 years ago

attil-io commented 6 years ago

The Viterbi algorithm implementation in tagger.cljc uses the values of the observations, instead of their index, to store the newly calculated probabilities:


(assoc T1 [cur-state cur-observation] (*
; -----------------------------------^
                         (if-let [p  (get emissions
                                        [cur-state cur-observation    ])]
                                        p
                                        0)
                          (reduce max (vals A*T))))))

E.g., if the sentence is

"Je mange une pomme"

, then cur-observation will take the values je, mange, une, pomme. This is fine, as long as each observation appears only once in the sentence.

However, consider the following example:

"Je te montre ma montre"

. In this sentence, "montre", appears both as a noun (the first occurrence), as well as a verb (the second occurrence). The current implementation would tag it either as verb, or as noun, but it obviously cannot tag both.

The following test case reproduces the issue: (is (= ["P" "P" "V" "P" "N"] (viterbi sample-model ["Je" "Te" "Montre" "Ma" "Montre"])))

(See tagger_test.cljc for sample-model.)

The proposed change basically consists of using the index of each observation, instead of the observation itself, to address T1.

turbopape commented 6 years ago

Thank you very much ! 👍 Are you using the lib or just playing around? I would love to know !

attil-io commented 6 years ago

Just playing around :) In fact, I'm in the process of learning Clojure, and know very little about speech tagging in general.