simongray / datalinguist

Stanford CoreNLP in idiomatic Clojure.
GNU General Public License v3.0
114 stars 5 forks source link

datafy and recur-datafy throw StackOverflowError #13

Open simongray opened 2 years ago

simongray commented 2 years ago

Seems like it is an infinite loop in the datafy-tsm implementation. Removing the datafy call from (assoc m k (datafy v)) and leaving just v seems to solve it for the regular datafy. This is also how it should be, it shouldn't be recursive in the case of datafy.

In the case of recur-datafy I will need to look further into what's causing it. I guess some sort of memory of is needed to avoid this issue.

ag91 commented 1 year ago

in case anybody else is trying the "sentiment" annotator, for instance:

(->> ((->pipeline {:annotators ["sentiment"]}) "Paula gave me 10 dollars. Of those $10 I used only one dollar. That felt bad. But also great.") 
     sentences
     (map (comp :sentiment recur-datafy))
     )

You can redefine recur-datafy like this (I left the debugging in case @simongray wants to try it out):

(in-ns 'dk.simongray.datalinguist)

(defmacro ignore-errors [& body]
  `(try ~@body (catch Exception e#)))

(def my (atom nil))

(defn recur-datafy
  "Return a recursively datafied representation of `x`.
  Call at the end of an annotation chain to get plain Clojure data structures."
  [x]
  (let [x* (datafy x)]
    ;; (prn "WOW---" x*)
    ;; (reset! my x*)
    (cond
      (seq? x*)
      (mapv recur-datafy x)

      (set? x*)
      (set (map recur-datafy x*))

      (map? x*)
      (ignore-errors (into {} (for [[k v] (dissoc x* :tree/binarized-tree :tree/tree) ;; (select-keys x*
                                    ;;                '(:tree/tree !
                                    ;;                  :token-end
                                    ;;                  :semantic-graph/collapsed-cc-processed-dependencies
                                    ;;                  :token-begin
                                    ;;                  :semantic-graph/basic-dependencies
                                    ;;                  :sentence-index
                                    ;;                  :sentiment
                                    ;;                  :semantic-graph/collapsed-dependencies
                                    ;;                  :character-offset-begin
                                    ;;                  :semantic-graph/enhanced-plus-plus-dependencies
                                    ;; ; :tree/binarized-tree !
                                    ;;                  :semantic-graph/enhanced-dependencies :tokens :character-offset-end :text
                                    ;;                  ))
                                    ]
                                [(recur-datafy k) (recur-datafy v)])))

      ;; Catches nearly all Java collections, including custom CoreNLP ones.
      (instance? Iterable x*)
      (mapv recur-datafy x*)

      :else x*)))

I discarded the :tree/binarized-tree :tree/tree keys, which seem to cause an infinite recursion. With the prn I see

"WOW---" :tree/tree
"WOW---" #object[edu.stanford.nlp.trees.LabeledScoredTreeNode 0x7b20fb2c "(ROOT (S (NP (NNP Paula)) (VP (VBD gave) (NP (PRP me)) (NP (CD 10) (NNS dollars))) (. .)))"]
"WOW---" #object[edu.stanford.nlp.trees.LabeledScoredTreeNode 0x7b20fb2c "(ROOT (S (NP (NNP Paula)) (VP (VBD gave) (NP (PRP me)) (NP (CD 10) (NNS dollars))) (. .)))"]
"WOW---" #object[edu.stanford.nlp.trees.LabeledScoredTreeNode 0x7b20fb2c "(ROOT (S (NP (NNP Paula)) (VP (VBD gave) (NP (PRP me)) (NP (CD 10) (NNS dollars))) (. .)))"]
"WOW---" #object[edu.stanford.nlp.trees.LabeledScoredTreeNode 0x7b20fb2c "(ROOT (S (NP (NNP Paula)) (VP (VBD gave) (NP (PRP me)) (NP (CD 10) (NNS dollars))) (. .)))"]
"WOW---" #object[edu.stanford.nlp.trees.LabeledScoredTreeNode 0x7b20fb2c "(ROOT (S (NP (NNP Paula)) (VP (VBD gave) (NP (PRP me)) (NP (CD 10) (NNS dollars))) (. .)))"]
"WOW---" #object[edu.stanford.nlp.trees.LabeledScoredTreeNode 0x7b20fb2c "(ROOT (S (NP (NNP Paula)) (VP (VBD gave) (NP (PRP me)) (NP (CD 10) (NNS dollars))) (. .)))"]
"WOW---" #object[edu.stanford.nlp.trees.LabeledScoredTreeNode 0x7b20fb2c "(ROOT (S (NP (NNP Paula)) (VP (VBD gave) (NP (PRP me)) (NP (CD 10) (NNS dollars))) (. .)))"]
"WOW---" #object[edu.stanford.nlp.trees.LabeledScoredTreeNode 0x7b20fb2c "(ROOT (S (NP (NNP Paula)) (VP (VBD gave) (NP (PRP me)) (NP (CD 10) (NNS dollars))) (. .)))"]
"WOW---" #object[edu.stanford.nlp.trees.LabeledScoredTreeNode 0x7b20fb2c "(ROOT (S (NP (NNP Paula)) (VP (VBD gave) (NP (PRP me)) 

Which means that recurring on the :tree/tree keyword continue to produce the same result. @simongray you can reproduce the logging by removing the dissoc


(->> ((->pipeline {:annotators ["sentiment"]}) "Paula gave me 10 dollars. Of those $10 I used only one dollar. That felt bad. But also great.") 
     sentences
     first
     recur-datafy
     )

Maybe it could be enough to return the string of the contents of :tree/tree and :tree/binarized-tree? If so, adding another instance? case in recur-datafy could do the job.

simongray commented 1 year ago

Thank you, @ag91. I must admit that I haven't been actively developing this wrapper for a while now, so these longstanding issues continue to persist.

Are you using it for a project? Or just dabbling?

ag91 commented 1 year ago

Oh, I was just dabbling with NLP really and I thought to try CoreNLP with Clojure. I like your library, it is making my exploration super easy: thank you for sharing it!

It is fine to leave it if I am the only user: I just wanted to help other users and you, if you ever wanted to investigate this further ;) (I can also open a PR if you have time and wish to save yourself some work. I am also fine with my personal fix)