Open legolego opened 4 years ago
Again, thank you for the very clear explanation. It looks like semgrex does not honor the timeout command line parameter. I'm not entirely clear why. Hopefully we can get a reason for this, but if nothing else, you can edit it yourself.
in StanfordCoreNLPServer.java look for this line: int semgrexTimeOut = (lastPipeline.get() == null) ? 75 : 5; I think that changing that will change this result.
On Wed, Nov 27, 2019 at 11:41 AM legolego notifications@github.com wrote:
Describe the bug Timeout when executing Semgrex query happens on long strings. This happens with the stanfordnlp library, but not the deprecated python-stanford-corenlp library.
To Reproduce Starting server with: java -Xms2048m -Xmx18024m -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 360000 -inputFormat 'text' -outputFormat 'json' -be_quiet false -serializer 'edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer' -tokenize.options 'ptb3Escaping=false,invertible=true' -tokenize.language 'en' -annotators "tokenize, ssplit, pos, lemma, ner, parse, depparse, coref"
Code to reproduce:
import time from google.protobuf.pyext._message import SetAllowOversizeProtos SetAllowOversizeProtos(True)
import corenlp # use with SemgrexFragmentGood
from stanfordnlp.server import CoreNLPClient # use with SemgrexFragmentBad
def assert_success(msg='assert OK'): print(msg) return True
def SemgrexFragmentGood(fragment):
SemgrexRule = '{pos:/NN.*/}=element >det {word:/a|an/}=art' output = '' with corenlp.CoreNLPClient(start_server=False, endpoint="http://localhost:9000", annotators="tokenize ssplit parse".split() ) as clientCoreNLP: clientCoreNLP.ensure_alive() output = clientCoreNLP.semgrex(fragment, pattern=SemgrexRule, filter=False) return output
def SemgrexFragmentBad(fragment):
SemgrexRule = '{pos:/NN.*/}=element >det {word:/a|an/}=art' output = '' # 'tokenize ssplit pos lemma ner depparse' with CoreNLPClient(start_server=False, endpoint="http://localhost:9000", timeout=240000, annotators='tokenize ssplit parse'.split()) as clientCoreNLP: clientCoreNLP.ensure_alive() assert clientCoreNLP.is_active and assert_success("SemgrexFragmentBad success 1") assert clientCoreNLP.is_alive() and assert_success("SemgrexFragmentBad success 2") ann = clientCoreNLP.annotate(fragment) TEXT = "Chris wrote a simple sentence that he parsed with Stanford CoreNLP.\n" #fragment = TEXT #ann = clientCoreNLP.annotate(fragment) #print(ann) output = clientCoreNLP.semgrex(fragment, pattern=SemgrexRule, filter=False) return output
start = time.time()
sentText = ''' In combination, self-contained scraper apparatus for quick attachment to, and detachment from, the rear of a light truck vehicle or the like, and attachment means suspended from the frame of said vehicle, the entirety of said attachment means being positioned at a level substantially at or below the bumper level of said vehicle characterized by said scraper apparatus comprising a securement frame having a pair of laterally spaced forwardly projecting telescoping frame members, said attachment means including an attachment frame mounted to the chassis of said vehicle at the rear thereof and having a pair of socket-defining recesses at opposite sides of said vehicle for telescopingly receiving corresponding ones of said telescoping frame members, means for securing said telescoping frame members in the respective socket-defining members, said scraper apparatus being thereby attachable to and carryable by said vehicle solely by telescoping frame members being received in said socket-defining members, a scraper blade, a blade-carrying frame, means for pivotally connecting said scraper blade to said blade-carrying frame, means providing pivotal interengagement of said blade-carrying frame to said securement frame to permit rotation of said blade-carrying frame about a transverse pivot axis, an electric winch carried by said securement frame, a winch cable extending from said electric winch to a transverse pivot axis-remote location on said blade-carrying frame, said winch being energizable in response to voltage supplied by said vehicle, whereby said winch effects raising of said blade-carrying frame by rotation about said transverse pivot axis under operator remote control, said means for pivotally connecting said blade to said blade-carrying frame including a sleeve carried by said blade-carrying frame at a location remote from said transverse pivot axis, a shaft secured to said blade and rotatable in said sleeve, and means for locking said shaft relative to said sleeve for permitting a preselected angular relationship between said blade and the longitudinal axis of said vehicle, said blade being pivotable by rotation of said shaft between forward facing and rearward facing orientations, said blade-carrying frame when raised by said winch pivoting upon said transverse pivot axis to raise said blade out of contact with the surface upon which said vehicle stands, and to a position proximate the rear of said vehicle, and when lowered by said winch permitting said blade to contact said surface with said blade-carrying frame extending rearwardly from said securement frame and from said vehicle with said blade having either forwardly or rearwardly facing orientation, whereby said scraper apparatus may be either quickly attached to or detached from said attachment by telescoping movement of said telescoping frame means members relative to said socket-defining members. '''
TEXT = "Chris wrote a simple sentence that he parsed with Stanford CoreNLP.\n"
sentText = TEXT
sentText = sentText.replace('\n', '') print(sentText)
output = SemgrexFragmentBad(sentText) # for regular element print(time.time() - start, " seconds") print(output)
with this error:
Traceback (most recent call last): File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\server\client.py", line 522, in __regex r.raise_for_status() File "C:\gitProjects\patentmoto2\venv\lib\site-packages\requests\models.py", line 940, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:9000/semgrex?pattern=%7Bpos%3A%2FNN.%2A%2F%7D%3Delement+%3Edet+%7Bword%3A%2Fa%7Can%2F%7D%3Dart&filter=False&properties=%7B%27inputFormat%27%3A+%27text%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%2C+%27outputFormat%27%3A+%27json%27%7D
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:/gitProjects/patentmoto2/SemgrexTimeout.py", line 64, in
output = SemgrexFragmentBad(sentText) # for regular element File "C:/gitProjects/patentmoto2/SemgrexTimeout.py", line 48, in SemgrexFragmentBad filter=False) File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\server\client.py", line 466, in semgrex matches = self.regex('/semgrex', text, pattern, filter, annotators, properties) File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\server\client.py", line 526, in regex raise TimeoutException(r.text) stanfordnlp.server.client.TimeoutException: Timeout when executing Semgrex query Process finished with exit code 1
Expected behavior
Commenting and uncommenting lines 5 and 6, and calling SemgrexFragmentBad or SemgrexFragmentGood as appropriate will work or fail as needed. stanfordnlp/python-stanford-corenlp#19 https://github.com/stanfordnlp/python-stanford-corenlp/issues/19 was a similar issue for the old library, and it looks like there's a bit of it left in client.py mentioned in the error message (line 526). In client.py I tried changing line 520 to timeout=240000 and it still failed. Also in client.py this is mentioned in line 497:
# HACK: For some stupid reason, CoreNLPServer will timeout if we # need to annotate something from scratch. So, we need to call # this to ensure that the _regex call doesn't timeout. self.annotate(text, properties=properties)
Environment:
- OS: Windows 10
- Python version: 3.7
- stanford-corenlp==3.9.2
- stanfordnlp==0.2.0
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanfordnlp/issues/166?email_source=notifications&email_token=AA2AYWIZP5FTKLDD7LOUOYDQV3EO3A5CNFSM4JSLN6VKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H4QEYMA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKHSCCCXXI7KHQ7LHTQV3EO3ANCNFSM4JSLN6VA .
I checked in a change which should unify the timeout logic between the root handler and the rest of the server. It should be available in the next release
On Tue, Dec 3, 2019 at 5:57 PM John Bauer horatio@gmail.com wrote:
Again, thank you for the very clear explanation. It looks like semgrex does not honor the timeout command line parameter. I'm not entirely clear why. Hopefully we can get a reason for this, but if nothing else, you can edit it yourself.
in StanfordCoreNLPServer.java look for this line: int semgrexTimeOut = (lastPipeline.get() == null) ? 75 : 5; I think that changing that will change this result.
On Wed, Nov 27, 2019 at 11:41 AM legolego notifications@github.com wrote:
Describe the bug Timeout when executing Semgrex query happens on long strings. This happens with the stanfordnlp library, but not the deprecated python-stanford-corenlp library.
To Reproduce Starting server with: java -Xms2048m -Xmx18024m -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 360000 -inputFormat 'text' -outputFormat 'json' -be_quiet false -serializer 'edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer' -tokenize.options 'ptb3Escaping=false,invertible=true' -tokenize.language 'en' -annotators "tokenize, ssplit, pos, lemma, ner, parse, depparse, coref"
Code to reproduce:
import time from google.protobuf.pyext._message import SetAllowOversizeProtos SetAllowOversizeProtos(True)
import corenlp # use with SemgrexFragmentGood
from stanfordnlp.server import CoreNLPClient # use with SemgrexFragmentBad
def assert_success(msg='assert OK'): print(msg) return True
def SemgrexFragmentGood(fragment):
SemgrexRule = '{pos:/NN.*/}=element >det {word:/a|an/}=art' output = '' with corenlp.CoreNLPClient(start_server=False, endpoint="http://localhost:9000", annotators="tokenize ssplit parse".split() ) as clientCoreNLP: clientCoreNLP.ensure_alive() output = clientCoreNLP.semgrex(fragment, pattern=SemgrexRule, filter=False) return output
def SemgrexFragmentBad(fragment):
SemgrexRule = '{pos:/NN.*/}=element >det {word:/a|an/}=art' output = '' # 'tokenize ssplit pos lemma ner depparse' with CoreNLPClient(start_server=False, endpoint="http://localhost:9000", timeout=240000, annotators='tokenize ssplit parse'.split()) as clientCoreNLP: clientCoreNLP.ensure_alive() assert clientCoreNLP.is_active and assert_success("SemgrexFragmentBad success 1") assert clientCoreNLP.is_alive() and assert_success("SemgrexFragmentBad success 2") ann = clientCoreNLP.annotate(fragment) TEXT = "Chris wrote a simple sentence that he parsed with Stanford CoreNLP.\n" #fragment = TEXT #ann = clientCoreNLP.annotate(fragment) #print(ann) output = clientCoreNLP.semgrex(fragment, pattern=SemgrexRule, filter=False) return output
start = time.time()
sentText = ''' In combination, self-contained scraper apparatus for quick attachment to, and detachment from, the rear of a light truck vehicle or the like, and attachment means suspended from the frame of said vehicle, the entirety of said attachment means being positioned at a level substantially at or below the bumper level of said vehicle characterized by said scraper apparatus comprising a securement frame having a pair of laterally spaced forwardly projecting telescoping frame members, said attachment means including an attachment frame mounted to the chassis of said vehicle at the rear thereof and having a pair of socket-defining recesses at opposite sides of said vehicle for telescopingly receiving corresponding ones of said telescoping frame members, means for securing said telescoping frame members in the respective socket-defining members, said scraper apparatus being thereby attachable to and carryable by said vehicle solely by telescoping frame members being received in said socket-defining members, a scraper blade, a blade-carrying frame, means for pivotally connecting said scraper blade to said blade-carrying frame, means providing pivotal interengagement of said blade-carrying frame to said securement frame to permit rotation of said blade-carrying frame about a transverse pivot axis, an electric winch carried by said securement frame, a winch cable extending from said electric winch to a transverse pivot axis-remote location on said blade-carrying frame, said winch being energizable in response to voltage supplied by said vehicle, whereby said winch effects raising of said blade-carrying frame by rotation about said transverse pivot axis under operator remote control, said means for pivotally connecting said blade to said blade-carrying frame including a sleeve carried by said blade-carrying frame at a location remote from said transverse pivot axis, a shaft secured to said blade and rotatable in said sleeve, and means for locking said shaft relative to said sleeve for permitting a preselected angular relationship between said blade and the longitudinal axis of said vehicle, said blade being pivotable by rotation of said shaft between forward facing and rearward facing orientations, said blade-carrying frame when raised by said winch pivoting upon said transverse pivot axis to raise said blade out of contact with the surface upon which said vehicle stands, and to a position proximate the rear of said vehicle, and when lowered by said winch permitting said blade to contact said surface with said blade-carrying frame extending rearwardly from said securement frame and from said vehicle with said blade having either forwardly or rearwardly facing orientation, whereby said scraper apparatus may be either quickly attached to or detached from said attachment by telescoping movement of said telescoping frame means members relative to said socket-defining members. '''
TEXT = "Chris wrote a simple sentence that he parsed with Stanford CoreNLP.\n"
sentText = TEXT
sentText = sentText.replace('\n', '') print(sentText)
output = SemgrexFragmentBad(sentText) # for regular element print(time.time() - start, " seconds") print(output)
with this error:
Traceback (most recent call last): File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\server\client.py", line 522, in __regex r.raise_for_status() File "C:\gitProjects\patentmoto2\venv\lib\site-packages\requests\models.py", line 940, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:9000/semgrex?pattern=%7Bpos%3A%2FNN.%2A%2F%7D%3Delement+%3Edet+%7Bword%3A%2Fa%7Can%2F%7D%3Dart&filter=False&properties=%7B%27inputFormat%27%3A+%27text%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%2C+%27outputFormat%27%3A+%27json%27%7D
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:/gitProjects/patentmoto2/SemgrexTimeout.py", line 64, in
output = SemgrexFragmentBad(sentText) # for regular element File "C:/gitProjects/patentmoto2/SemgrexTimeout.py", line 48, in SemgrexFragmentBad filter=False) File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\server\client.py", line 466, in semgrex matches = self.regex('/semgrex', text, pattern, filter, annotators, properties) File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\server\client.py", line 526, in regex raise TimeoutException(r.text) stanfordnlp.server.client.TimeoutException: Timeout when executing Semgrex query Process finished with exit code 1
Expected behavior
Commenting and uncommenting lines 5 and 6, and calling SemgrexFragmentBad or SemgrexFragmentGood as appropriate will work or fail as needed. stanfordnlp/python-stanford-corenlp#19 https://github.com/stanfordnlp/python-stanford-corenlp/issues/19 was a similar issue for the old library, and it looks like there's a bit of it left in client.py mentioned in the error message (line 526). In client.py I tried changing line 520 to timeout=240000 and it still failed. Also in client.py this is mentioned in line 497:
# HACK: For some stupid reason, CoreNLPServer will timeout if we # need to annotate something from scratch. So, we need to call # this to ensure that the _regex call doesn't timeout. self.annotate(text, properties=properties)
Environment:
- OS: Windows 10
- Python version: 3.7
- stanford-corenlp==3.9.2
- stanfordnlp==0.2.0
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanfordnlp/issues/166?email_source=notifications&email_token=AA2AYWIZP5FTKLDD7LOUOYDQV3EO3A5CNFSM4JSLN6VKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H4QEYMA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKHSCCCXXI7KHQ7LHTQV3EO3ANCNFSM4JSLN6VA .
Alright, thank you! :) Any idea if there will be a release before the end of the year? I don't know enough about Java to get it to compile myself.
Hopefully yes.
Loosely speaking the steps are:
install "ant" unzip the jar with "sources" in the name mkdir src mv edu src ant
that should work
now copy the updated StanfordCoreNLPServer.java to src/edu/stanford/nlp/pipeline ant
hopefully that works too
now get rid of the corenlp jar and put "classes" in your classpath instead
If all of that works, great! If not, hopefully the next release will be in the next few weeks.
On Tue, Dec 3, 2019 at 9:46 PM legolego notifications@github.com wrote:
Alright, thank you! :) Any idea if there will be a release before the end of the year? I don't know enough about Java to get it to compile myself.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanfordnlp/issues/166?email_source=notifications&email_token=AA2AYWNJAQXTBSPLNSJJ5WDQW472VA5CNFSM4JSLN6VKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF3Z4LA#issuecomment-561487404, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMOEUYRZEMDQGQ3JSTQW472VANCNFSM4JSLN6VA .
Hi, is coreNLP designed for analyzing a large collection of articles? I met the same problem when applying it directly to my DB. I bypassed the issue by closing the client-server and opening a new one for each article... Is there a way to make it running for hours ? Thank you in advance. I am using stanford-corenlp-4.4.0 in English. Thanks in advance.
I bypassed the issue by closing the client-server and opening a new one for each article...
That sounds incredibly expensive.
What is the problem you are running into? It times out on a very long article? Can you tell us how long the article is?
Hi AngledLuffa, I ran into the problem TimeoutException: Timeout when executing Tregex query when applying to a dataset of articles, each of them contains about 5000-6000 tokens. This problem appears generally after a dozen of articles. I am using stanza 1.3.0 and stanford-corenlp-4.4.0, call the server with CoreNLPClient(timeout=30000, memory='16G'). Thank you in advance for your help.
File ~/miniconda3/envs/Similarity/lib/python3.8/site-packages/stanza/server/client.py:568, in CoreNLPClient.tregex(self, text, pattern, filter, annotators, properties) 567 def tregex(self, text, pattern, filter=False, annotators=None, properties=None): --> 568 return self.__regex('/tregex', text, pattern, filter, annotators, properties)
File ~/miniconda3/envs/Similarity/lib/python3.8/site-packages/stanza/server/client.py:621, in CoreNLPClient.__regex(self, path, text, pattern, filter, annotators, properties) 619 except requests.HTTPError as e: 620 if r.text.startswith("Timeout"): --> 621 raise TimeoutException(r.text) 622 else: 623 raise AnnotationException(r.text)
TimeoutException: Timeout when executing Tregex query
Okay, if you mean tregex and not semgrex, that's possibly different. By default the tregex endpoint uses the old chart parser, and the corenlp chart parser can blow up in terms of memory usage and time on long sentences.
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.5 sec].
I don't know why it still uses that default, but you can change it to the SR parser by adding this to the CoreNLPClient you make:
properties={ 'parse.model': 'edu/stanford/nlp/models/srparser/englishSR.beam.ser.gz' },
You will need to download the english-extra models listed under Quickstart: https://stanfordnlp.github.io/CoreNLP/
Actually, are you even getting anything back? The tregex endpoint apparently doesn't even use the parse annotator by default, so I'm just getting blanks unless I also add this:
annotators="tokenize,ssplit,pos,parse"
BUT
I've been looking for someone interested in using tregex over constituency parses via python. We have a new constituency parser in Stanza which is more accurate than the CoreNLP parser, after all, but there's no chance we reimplement all of tregex in Stanza, so using tregex on those trees would involve writing a python module which connects to the Java version of tregex. Do you have any interest in beta testing such an interface?
I added annotators="tokenize,ssplit,pos,parse" so I can get noun phrases. I have experimented with the new constituency parser in Stanza to substitute CoreNLPClient and solved my problem. Thanks again AngledLuffa.
Describe the bug Timeout when executing Semgrex query happens on long strings. This happens with the
stanfordnlp
library, but not the deprecatedpython-stanford-corenlp
library.To Reproduce Starting server with:
java -Xms2048m -Xmx18024m -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 360000 -inputFormat 'text' -outputFormat 'json' -be_quiet false -serializer 'edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer' -tokenize.options 'ptb3Escaping=false,invertible=true' -tokenize.language 'en' -annotators "tokenize, ssplit, pos, lemma, ner, parse, depparse, coref"
Code to reproduce:
with this error:
Expected behavior
Commenting and uncommenting lines 5 and 6, and calling SemgrexFragmentBad or SemgrexFragmentGood as appropriate will work or fail as needed. https://github.com/stanfordnlp/python-stanford-corenlp/issues/19 was a similar issue for the old library, and it looks like there's a bit of it left in
client.py
mentioned in the error message (line 526). Inclient.py
I tried changing line 520 totimeout=240000
and it still failed. Also inclient.py
this is mentioned in line 497:Environment: