snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLPServer #553

Closed Tsmith5151 closed 7 years ago

Tsmith5151 commented 7 years ago

@henryre -- encountering the following error as shown below; the failure to make a connection to CoreNLP is due to running snorkel on a distributed cluster. For preprocessing/tokenizing/tagging a corpus, is NLTK a suggested workaround here?

WARNING:requests.packages.urllib3.connectionpool:Retrying 
(Retry(total=None, connect=19, read=0, redirect=None)) after connection broken
Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLPServer 
henryre commented 7 years ago

Hi @Tsmith5151. Here are a couple suggestions

Tsmith5151 commented 7 years ago

@henryre thanks for the feedback. One other question related to this, I have a json file that has been annotated using CoreNLP (tokenize/ssplit/pos/lemma/depparse/ner) -- is there a way I can import this file directly into the sqlite.db through snorkel to maintain the same db schema, or will this needed to be replicated?

henryre commented 7 years ago

Hi @Tsmith5151. The Snorkel parser loads responses from the CoreNLP server in json format here. You can modify the parse method to take in the file contents as content rather than requesting it from the server.

Tsmith5151 commented 7 years ago

Hi @henryre thanks, worked perfectly! One other quick question....running across the following error when training the generative model with the learning functions and estimating their accuracy. The following error occurs when calling NumbSkull RuntimeError: cannot cache function 'gibbsthread': no locator available for file '/anaconda3/lib/python3.5/site-packages/numbskull-0.0-py3.5.egg/numbskull/inference.py'...suggestions?

henryre commented 7 years ago

Hey @Tsmith5151, this is probably a Python 2/3 compatibility issue related to Numba. If you're able to run your pipeline using Python 2, I'd give that a try.