openeventdata / UniversalPetrarch

Language-agnostic political event coding using universal dependencies
MIT License
18 stars 9 forks source link

Error when running preprocess_doc.py #7

Open ahalterman opened 6 years ago

ahalterman commented 6 years ago

I'm getting an error when I try to run the built in English demo. I've downloaded CoreNLP, UDPipe, and the models, but I'm hitting an error in the Python code that runs right after CoreNLP.

Does the demo not work with the built in GigaWord.sample.PETR.xml file?

Here's the error:

ahalterman:preprocessing$ bash run_document.sh GigaWord.sample.PETR.xml
Call Stanford CoreNLP to do sentence splitting...
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator cleanxml
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit

Processing file /Users/ahalterman/MIT/NSF_RIDIR/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml ... writing to /Users/ahalterman/MIT/NSF_RIDIR/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.out
Annotating file /Users/ahalterman/MIT/NSF_RIDIR/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml ... done [0.3 sec].

Annotation pipeline timing information:
TokenizerAnnotator: 0.2 sec.
CleanXmlAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
TOTAL: 0.3 sec. for 12272 tokens at 38470.2 tokens/sec.
Pipeline setup: 0.1 sec.
Total time for StanfordCoreNLP pipeline: 0.8 sec.
Generate sentence xml file...
Traceback (most recent call last):
  File "preprocess_doc.py", line 161, in <module>
    read_doc_input(inputxml,inputparsed,outputfile)
  File "preprocess_doc.py", line 96, in read_doc_input
    doc = doctexts[0]
IndexError: list index out of range
JingL1014 commented 6 years ago

Hi Andy,

The error is caused by the fact that Stanford CoreNLP processes both "text" and "parse" elements in the GigaWord.sample.PETR.xml. I have fixed the errors. Please try run_sentence.sh GigaWord.sample.PETR.xml again.

ahalterman commented 6 years ago

Thanks, @JingL1014! That fixed that problem, but now I'm hitting another. The initial Python and the CoreNLP are both working, but I get this error when I try to load the UDPipe model:

Call udpipe to do pos tagging and dependency parsing...
Loading UDPipe model: Cannot load UDPipe model '/Users/ahalterman/MIT/NSF_RIDIR/udpipe-1.0.0-bin/models/english-ud-2.0-170801.udpipe'!

I've double checked the path so I think it's a problem elsewhere.

I also ran into another issue that was fixed by specifying Python 2 in run_sentence.sh:

ahalterman:preprocessing$ bash run_sentence.sh GigaWord.sample.PETR.xml
Prepare file for stanford CoreNLP
Traceback (most recent call last):
  File "preprocess_sent.py", line 67, in <module>
    main()
  File "preprocess_sent.py", line 63, in main
    read_sentence_input(inputxml,outputfile)
  File "preprocess_sent.py", line 56, in read_sentence_input
    ofile.write(line.encode('utf-8')+"\n")
TypeError: can't concat bytes to str

When I change the code to python2 this error goes away.

JingL1014 commented 6 years ago

Hi Andy,

I think it is because of the mismatch of udpipe version and language model version. I am using udpipe-1.0.0 and language model with UD 1.2 ( http://ufal.mff.cuni.cz/udpipe/users-manual#universal_dependencies_12_models) This works properly. I think in order to use language model with UD 2.0, you have to use udpipe-1.2.0.

ahalterman commented 6 years ago

With the correct model, it ran just fine. Thanks!

khaledJabr commented 6 years ago

Hey, I am having a similar issue. When I try to run ./run_document.sh Sample_english_doc.xml. The Error output is similar to the one Andy showed, here it is :

Call Stanford CoreNLP to do sentence splitting...
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator cleanxml
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit

Processing file /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/Sample_english_doc.xml ... writing to /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/Sample_english_doc.xml.out
Annotating file /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/Sample_english_doc.xml ... done [0.1 sec].

Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
CleanXmlAnnotator: 0.0 sec.
WordsToSentencesAnnotator: 0.0 sec.
TOTAL: 0.1 sec. for 219 tokens at 2517.2 tokens/sec.
Pipeline setup: 0.1 sec.
Total time for StanfordCoreNLP pipeline: 0.3 sec.
Generate sentence xml file...
Traceback (most recent call last):
  File "preprocess_doc.py", line 161, in <module>
    read_doc_input(inputxml,inputparsed,outputfile)
  File "preprocess_doc.py", line 113, in read_doc_input
    doc = doctexts[0]
IndexError: list index out of range
Call udpipe to do pos tagging and dependency parsing...
readline() on closed filehandle DOC at /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/scripts/create_conll_corpus_from_text.pl line 6.
Loading UDPipe model: done.
Ouput parsed xml file...
Traceback (most recent call last):
  File "generateParsedFile.py", line 47, in <module>
    update_xml_input(inputFile,parsedFile,outputFile)
  File "generateParsedFile.py", line 15, in update_xml_input
    xml_file = io.open(inputfile,'rb')
IOError: [Errno 2] No such file or directory: '/Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/Sample_english_doc.xml-sent.xml'
rm: /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/Sample_english_doc.xml-sent.txt: No such file or directory

I have also tried ./run_sentence.sh GigaWord.sample.PETR.xml, and It gives me another error. I got the following error:

Prepare file for stanford CoreNLP
Call Stanford CoreNLP to do tokenization...
property file path: 

Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: argsToProperties could not read properties file: true
    at edu.stanford.nlp.util.StringUtils.argsToProperties(StringUtils.java:1011)
    at edu.stanford.nlp.util.StringUtils.argsToProperties(StringUtils.java:927)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1416)
Caused by: java.io.IOException: Unable to open "true" as class path, filename or URL
    at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:481)
    at edu.stanford.nlp.io.IOUtils.readerFromString(IOUtils.java:618)
    at edu.stanford.nlp.util.StringUtils.argsToProperties(StringUtils.java:1002)
    ... 2 more
Generate sentence xml file...
Traceback (most recent call last):
  File "preprocess.py", line 140, in <module>
    read_doc_input(inputxml,inputparsed,outputfile)
  File "preprocess.py", line 61, in read_doc_input
    parsed = io.open(inputparsed,'r',encoding='utf-8')
IOError: [Errno 2] No such file or directory: '/Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.raw.txt.out'
readline() on closed filehandle DOC at /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/scripts/create_conll_corpus_from_text.pl line 6.
rm: /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.raw.txt.out: No such file or directory
rm: /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.txt: No such file or directory
Call udpipe to do pos tagging and dependency parsing...
Udpipe model path: 

Loading UDPipe model: Cannot load UDPipe model '/Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.conll'!
Ouput parsed xml file...
Traceback (most recent call last):
  File "generateParsedFile.py", line 47, in <module>
    update_xml_input(inputFile,parsedFile,outputFile)
  File "generateParsedFile.py", line 9, in update_xml_input
    pfile = io.open(parsedfile,'r',encoding='utf-8')
IOError: [Errno 2] No such file or directory: '/Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.conll.predpos.pred'
rm: /Users/kld/Documents/workspace/EventData/UniversalPetrarch/UniversalPetrarch/data/text/GigaWord.sample.PETR.xml.conll.predpos.pred: No such file or directory

Any ideas on what could be the problem?

Thanks

ahalterman commented 6 years ago

@JingL1014 This looks like the error I got before you updated the code. Any idea what's going on?

JingL1014 commented 6 years ago

@khaledJabr For the second problem, you missed an argument. You have to specify the language of the input file, the value is EN, ES or AR. Please run ./run_sentence.sh GigaWord.sample.PETR.xml EN

JingL1014 commented 6 years ago

@khaledJabr For the first problem, I can run the code on my machine without any error. May I know which version of Stanford CoreNLP are you using?

khaledJabr commented 6 years ago

@JingL1014
I need to clarify one thing. There are two text folders in the repo. One is in /UniversalPetrarch/preprocessing and one is in /UniversalPetrarch/data/text. When I configured run_sentence.sh and run_document.sh , I set the FILEPATH and FILE to /UniversalPetrarch/data/text, and that's where Sample_english_doc.xml exists.

I am still getting the same error when I run ./run_document.sh Sample_english_doc.xml.

However, After ./run_sentence.sh GigaWord.sample.PETR.xml EN, I found out that I didn't have the Udpipe models installed correctly. I fixed that and it works correctly now.

I am using the latest version CoreNLP, 3.9.0

JingL1014 commented 6 years ago

@khaledJabr I updated run_document.sh, right now this script also requires an input argument to specify the language. please try ./run_document.sh Sample_english_doc.xml EN again. I was not able to download CoreNLP 3.9.0, but I tested on CorefNLP 3.8.0, and it runs correctly. If it is still not working on CoreNLP 3.9.0., could you comment line 77-80 , and send me those files generated in the intermediate steps?