openeventdata / UniversalPetrarch

Language-agnostic political event coding using universal dependencies
MIT License
18 stars 9 forks source link

Issue with writing events out #8

Closed ahalterman closed 6 years ago

ahalterman commented 6 years ago

UniversalPetrarch is running fine for me on the built in example and I can see it writing out the coding as it goes and the summary stats (woo!). I can't get it to write out to a file, though. I uncommented the line that seems to be response for writing out (here), but when I uncomment it I get the following error:

Traceback (most recent call last):
  File "petrarch_ud.py", line 402, in <module>
    main()
  File "petrarch_ud.py", line 70, in main
    run(paths, out , True)  ## <===
  File "petrarch_ud.py", line 399, in run
    PETRwriter.write_events(updated_events, 'evts.' + out_file)
  File "/Users/ahalterman/MIT/NSF_RIDIR/UniversalPetrarch/UniversalPetrarch/PETRwriter.py", line 66, in write_events
    filtered_events = utilities.story_filter(story_dict, key)
  File "/Users/ahalterman/MIT/NSF_RIDIR/UniversalPetrarch/UniversalPetrarch/utilities.py", line 329, in story_filter
    if 'actortext' in sent_dict['meta'] and event_tuple[1:] in sent_dict['meta'][
KeyError: u'meta'

This dictionary organization/structure has never been great, so we might want to think about making it nicer if the fix indeed involves digging into it.

philip-schrodt commented 6 years ago

Bigger issue is that I think everything is still Python 2 and it would probably make sense at this point to move to Python 3, which shouldn't be that difficult but probably should be coordinated given all of the other work going on with UD-PETR at the moment. This also goes to the bigger issue that I raised last week about continued (or not) development on the English fork of this.

JingL1014 commented 6 years ago

Part of the code for writing out is not compatible with UDPetrarch. I fixed those errors for writing out events. Please try again. I updated preprocessing code, please run reprocessing again as well.

ahalterman commented 6 years ago

The code ran just fine but the example Gigaword text doesn't contain any event-producing sentences to test the writing functionality. I'll see if I can find some that will produce events.

JingL1014 commented 6 years ago

i can generate output as attached. Is it the output as expected? events.txt

ahalterman commented 6 years ago

Hmm, now that I look at it, I think there was a problem in the parsing step. It looks like the parse output got cut off on all of them. Here's the first entry of GigaWord.sample.PETR_parsed.xml

<Sentences>
<Sentence date="20080804" id="AFP0808020625_4" sentence="True" source="AFP">
<Text>
The stopover came as the US leader prepared to attend the Beijing Olympics, an 
event which will test his vow to keep politics out of the Games. 
</Text>
<Parse>1    The the DET DT  Definite=Def|PronType=Art   0   root    _   _
</Parse></Sentence>
JingL1014 commented 6 years ago

I updated the preprocessing code, so that it can run in Python 2 and Python 3. I have uploaded the GigaWord.sample.PETR_parsed.xml as well. Also I uploaded UDpipe model for three languages, and segmenter for Arabic. Please try again. Right now the command run_sentence.sh has two arguments, file name and language. ./run_sentence.sh GigaWord.sample.PETR.xml english ./run_sentence.sh Sample_arabic_sent.xml arabic ./run_sentence.sh Sample_spanish_sent.xml spanish

PTB-OEDA commented 6 years ago

Probably should make the args for the languages be follow ISO2 or 3 So EN, ES, AR, etc....

On Oct 31, 2017 22:18, "JingL1014" notifications@github.com wrote:

I updated the preprocessing code, so that it can run in Python 2 and Python 3. I have uploaded the GigaWord.sample.PETR_parsed.xml https://github.com/openeventdata/UniversalPetrarch/blob/master/UniversalPetrarch/data/text/GigaWord.sample.PETR_parsed.xml as well. Also I uploaded UDpipe model for three languages, and segmenter for Arabic. Please try again. Right now the command run_sentence.sh has two arguments, file name and language. ./run_sentence.sh GigaWord.sample.PETR.xml english ./run_sentence.sh Sample_arabic_sent.xml arabic ./run_sentence.sh Sample_spanish_sent.xml spanish

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openeventdata/UniversalPetrarch/issues/8#issuecomment-340967332, or mute the thread https://github.com/notifications/unsubscribe-auth/AJrP1hf7gpMTOI8Ws0cBp4pt8JXgqNhKks5sx-MdgaJpZM4QHoKK .

ahalterman commented 6 years ago

I've updated to the most recent code but now I'm getting an error with CoreNLP. I've changed the header of run_document.sh to have all the same locations as the previous version, but now it can't load CoreNLP for some reason.

ahalterman:preprocessing$ ./run_sentence.sh GigaWord.sample.PETR.xml english
Prepare file for stanford CoreNLP
Call Stanford CoreNLP to do tokenization...
property file path:
config/StanfordCoreNLP-english.properties
Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLP
JingL1014 commented 6 years ago

In the most recent version of run_sentence.sh, only one location need to be changed, others are in the folder. Is STANFORD_CORENLP=/users/ljwinnie/toolbox/stanford-corenlp-full-2015-01-29 set properly?

ahalterman commented 6 years ago

I see. I was changing the ones in run_document.sh but didn't see that it also has to be changed in run_sentence.sh. I can get past the CoreNLP part, but now I'm hitting an error when it tries to write out:

WordsToSentencesAnnotator: 0.0 sec.
TOTAL: 0.2 sec. for 1411 tokens at 6950.7 tokens/sec.
Pipeline setup: 0.2 sec.
Total time for StanfordCoreNLP pipeline: 0.6 sec.
Generate sentence xml file...
Traceback (most recent call last):
  File "preprocess.py", line 138, in <module>
    read_doc_input(inputxml,inputparsed,outputfile)
  File "preprocess.py", line 93, in read_doc_input
    doc = doctexts[idx]
IndexError: list index out of range
JingL1014 commented 6 years ago

I found the problem is that the latest CoreNLP(version 2017-06-09) has different output format than the older versions. I wrote the code and test using (version 2015-01-29). I just made a new commit and tested under both versions and the code works now.

Also, run_document.sh and run_sentence.sh are independent of each other. The difference is that the input of run_document.sh are articles and it does sentence splitting, tokenization, and parsing, while the input of run_sentence.sh are sentences and it only does tokenization and parsing.

ahalterman commented 6 years ago

Thanks! I can now run CoreNLP. I had to change some of the hardcoded paths in run_sentence.sh to match my udpipe model locations, but the preprocessing worked fine.

I'm now running into an issue with UniversalPetrarch itself:

...
petr_log.PETRgraph: DEBUG    [[u'---COP'], ['---'], '010']
petr_log.PETRgraph: DEBUG    ['---']
Traceback (most recent call last):
  File "petrarch_ud.py", line 409, in <module>
    main()
  File "petrarch_ud.py", line 71, in main
    run(paths, out , True)  ## <===
  File "petrarch_ud.py", line 405, in run
    updated_events = do_coding(events)
  File "petrarch_ud.py", line 318, in do_coding
    coded_events = sentence.get_events()
  File "/Users/ahalterman/MIT/NSF_RIDIR/UniversalPetrarch/UniversalPetrarch/PETRgraph.py", line 1583, in get_events
    if self.events[eventID][2] not in ['---',None,'None'] and self.events[eventID][2] != PETRglobals.VerbDict['verbs'][triplet['triple'][2].head.upper()]['#']['#']['code']:
KeyError: u'ISSUED'
JingL1014 commented 6 years ago

I fixed the error and made the args of preprocessing code for the languages as EN, AR, ES as Dr. Brandt suggested.

ahalterman commented 6 years ago

I've successfully run the pipeline from start to finish (in English) with no errors. Thanks for everything!