Closed mkarmona closed 2 years ago
I have seen something similar before. Is this in a multithreaded environment? What semgrex operation did you run?
On Fri, Aug 12, 2022, 2:23 AM Miguel Carmona @.***> wrote:
CoreNLP version 4.5.0 using pos lemma depparse. I run the pipeline within Spark (Scala). I lazy initialise the CoreNLP pipeline and I broadcast the pipeline to each executor using lazy instantiation wrapped in a case object. Also I force not to split the text fragment as it is intended to be a sentence already. The objective here is to do dependency analysis on the sentence and run some semgraph rules against it. We got a case where it throws an exception like this
Caused by: edu.stanford.nlp.semgraph.UnknownVertexException: Operation attempted on unknown vertex happens/VBZ'''' in graph -> observed/VBD (root)
-> 24/CD (nsubj)
-> response/NN (nmod:in) -> In/IN (case) -> CoV/NNP (nmod:to) -> to/IN (case) -> SARS/NNP (compound) -> ‐/SYM (dep) -> ‐/SYM (dep) -> peptides/NNS (dep) -> 2/CD (nummod)
-> ,/, (punct)
-> we/PRP (nsubj)
-> unexpectedly/RB (advmod)
-> associated/VBN (ccomp)
-> that/IN (mark) -> sirolimus/NN (nsubj:pass) -> was/VBD (aux:pass) -> significantly/RB (advmod) -> release/NN (obl:with) -> with/IN (case) -> a/DT (det) -> proinflammatory/JJ (amod) -> cytokine/NN (compound) -> levels/NNS (nmod:including) -> including/VBG (case) -> higher/JJR (amod) -> α/NN (nmod:of) -> of/IN (case) -> TNF/NN (compound) -> ‐/SYM (dep) -> IL/NN (conj:and) -> and/CC (cc) -> IL/NN (nmod:of) -> 1β/NN (nmod) -> ‐/SYM (dep)
-> ./. (punct)
at edu.stanford.nlp.semgraph.SemanticGraph.parentPairs(SemanticGraph.java:730)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.advance(GraphRelation.java:325)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1103)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.
(GraphRelation.java:1084) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.
(GraphRelation.java:310) at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT.searchNodeIterator(GraphRelation.java:310)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:339)
at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.resetChildIter(SemgrexMatcher.java:80)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChild(NodePattern.java:363)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.goToNextNodeMatch(NodePattern.java:457)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:574)
at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:193)
at az.bikg.nlp.etl.common.nlp.Pattern.go$3(Pattern.scala:200)
at az.bikg.nlp.etl.common.nlp.Pattern.$anonfun$findCauseEffectMatches$6(Pattern.scala:268)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at az.bikg.nlp.etl.common.nlp.Pattern.findCauseEffectMatches(Pattern.scala:266)
at az.bikg.nlp.etl.steps.ERs$.findRelations(ERs.scala:107)
at az.bikg.nlp.etl.steps.ERs$.findRelationsSpark(ERs.scala:229)
at az.bikg.nlp.etl.steps.ERs$.$anonfun$extractERs$1(ERs.scala:242)
... 28 more
Am I doing anything wrong because of this exception?
— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1296, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKUEDPBTU3WG7XB4RLVYYJYXANCNFSM56LAXVKQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks for coming back to me so quickly. I create a lazy instance construction wrapped of a pipeline per executor in a Spark environment val pipeline = new StanfordCoreNLP(props)
. That basically means, the object that contains the creation instruction is serialised to each executor and then the instance is created and mantained; this is done because CoreNLP creation is not cheap and it needs the model on mem as you already know.
Then, each executor uses that long-lived instance through that number of cores in that node. What do I do with that pipeline instance? I process a sentence in this way
val res = Try {
val doc = pipeline.processToCoreDocument(sen)
val sentence = doc.sentences().get(0)
val semanticGraph = sentence.dependencyParse()
val pattern = Pattern(semanticGraph)
...
Where my Pattern class uses the analysed dependency graph against some precompiled patterns. In my mind, this type of concurrency error might come from a pipeline that is not fully immutable or thread-safe.
I absolutely agree, this is some kind of concurrency bug.
Am I understanding correctly that each executor has its own Pipeline, or are they sharing the Pipelines?
How about the Semgrex operations? Are those patterns precompiled per executor or shared between items?
I'm trying to figure out where to look for the error. I expect it's either in the depparse or semgrex somewhere based on the error
This is the full case object
I use to serialise the pipeline creation to each executor and I assume each pipeline is immutable as it is created in each executor shared across cores in a node.
case object CoreNLPWrapper {
def make: StanfordCoreNLP = {
val defaultAnnotators: List[String] =
List("tokenize", "ssplit", "pos", "lemma", "depparse", "natlog")
val props = new Properties()
props.setProperty(
"annotators",
defaultAnnotators.mkString(",")
)
props.setProperty("ssplit.isOneSentence", "true")
val pipeline = new StanfordCoreNLP(props)
pipeline
}
lazy val parser: StanfordCoreNLP = make
}
so when I wrap this object within a broadcast and ask for parser
for the first time the pipeline is created in each executor. Then, that variable lives the whole executor job's life.
For the semgrex
I reuse the produced output from SemgrexBatchParser
across all the access for searching in each of the created pattern objects I pasted before. can I do that or should I generate the batch parser for each semgraph
?
I don't know enough about how broadcast works to know whether that is a new object in each executor or the same one. Honestly, I don't know that system at all. If there's some way to get something I can run which will cause this issue, that would be great.
It's really weird that it didn't happen with 4.4.0 - nothing changed in the dependency parser or in the semgrex which would change that behavior. The only thing I can think of which would affect things downstream would be the tokenizer or lemmatizer changes causing differences in the annotations you're getting back.
One possibility that might help find the error would be to send you a version with more logging to show when & where it crashes, although to be honest right now I don't even know what's causing the problem. Is it possible for you to use a new jar file if we send you one?
Another thing we could do to try to isolate it is turn off natlog for now. If you can recreate the exception without that annotator, that would be one less place to search
I wonder if this has something to do with topologicalSortCache
in SemgrexMatcher. I don't remember that feature from way back when I first made these things threadsafe, and it looks like it could easily result in an inconsistent state between graphs.
An easy way to test would be to remove it and send you a jar. Is that something you can try?
(Note: that previous one apparently only applies if you are using aligned graphs. I don't know if any of the graphs used in the system do, though)
The other instance where this happened recently was on 4.4.0, and used the annotators tokenize,cleanxml,ssplit,pos,lemma,parse
, so I do not believe the 4.4.0 -> 4.5.0 differences or the natlog annotator. Although weirdly they were not using Semgrex themselves afaik.
@d0ngw
If I understand the stack trace in this version of the problem, it is here where you are calling our stuff:
at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:193)
at az.bikg.nlp.etl.common.nlp.Pattern.go$3(Pattern.scala:200)
So this appears to be after CoreNLP has processed the sentence, at the time of using semgrex. Does that sound correct?
https://nlp.stanford.edu/software/stanford-corenlp-4.5.0b.zip
I think I fixed this bug... would you give it a try if possible?
Thank you very much. I will also review my side of the code.
@mkarmona have you had a chance to try out the updated package? If it works, we'll go ahead with a bugfix release sometime in the near future.
I am in slowmo holiday. I will give it a go when I have the chance. Sorry
No worries. I also thought of a test to verify that this was the problem, but it'd be a little annoying to implement, so I was hoping you'd just do it for us :)
@AngledLuffa, I have done some work on my side. Both versions, 4.4 and 4.5, suffer the same concurrency problem. Moving the SemgrexPattern
compilation in each executor within the lazy object instantiation made the trick on my side so I can safely go through tens of millions of documents (still running) with Spark and CoreNLP happily again. It was also my fault; I shouldn't be doing it that way and assuming thread-safe.
I am afraid I couldn't test your version. The main reasons:
if you populate a minor version on the Maven platform, I can commit a minor internal release. it will be scheduled in the next month's run.
I can't tell if that evidence helps or hurts my theory that it's the attempted cache of the semgrex graphs causing the problems. It certainly should be thread-safe, and we'll work to make sure it is thread-safe again.
We should be able to get a minor version on Maven in another week or so, if that works with your timeline for "next month's run". There's a couple small tokenizer problems we're going to fix first as well
Version 4.5.1 is now on Maven. Would you let us know if the crashes go away?
(@mkarmona)
Sure. Asap I come back to business.
Any luck with the crashes? Hoping they went away with the new version
It works in a run where it was failing before. All fully addressed? I cannot assert it.
Awesome, glad to hear it. I will consider the matter closed unless we hear otherwise
CoreNLP version 4.5.0 using
pos lemma depparse
. I run the pipeline within Spark (Scala). I lazy initialise the CoreNLP pipeline and I broadcast the pipeline to each executor using lazy instantiation wrapped in a case object. Also I force not to split the text fragment as it is intended to be a sentence already. The objective here is to do dependency analysis on the sentence and run somesemgraph
rules against it. We got a case where it throws an exception like thisAm I doing anything wrong because of this exception? It didn't happen with version
4.4.0
.