Closed ahalterman closed 6 years ago
Bigger issue is that I think everything is still Python 2 and it would probably make sense at this point to move to Python 3, which shouldn't be that difficult but probably should be coordinated given all of the other work going on with UD-PETR at the moment. This also goes to the bigger issue that I raised last week about continued (or not) development on the English fork of this.
Part of the code for writing out is not compatible with UDPetrarch. I fixed those errors for writing out events. Please try again. I updated preprocessing code, please run reprocessing again as well.
The code ran just fine but the example Gigaword text doesn't contain any event-producing sentences to test the writing functionality. I'll see if I can find some that will produce events.
i can generate output as attached. Is it the output as expected? events.txt
Hmm, now that I look at it, I think there was a problem in the parsing step. It looks like the parse output got cut off on all of them. Here's the first entry of GigaWord.sample.PETR_parsed.xml
<Sentences>
<Sentence date="20080804" id="AFP0808020625_4" sentence="True" source="AFP">
<Text>
The stopover came as the US leader prepared to attend the Beijing Olympics, an
event which will test his vow to keep politics out of the Games.
</Text>
<Parse>1 The the DET DT Definite=Def|PronType=Art 0 root _ _
</Parse></Sentence>
I updated the preprocessing code, so that it can run in Python 2 and Python 3. I have uploaded the GigaWord.sample.PETR_parsed.xml as well. Also I uploaded UDpipe model for three languages, and segmenter for Arabic. Please try again. Right now the command run_sentence.sh has two arguments, file name and language.
./run_sentence.sh GigaWord.sample.PETR.xml english
./run_sentence.sh Sample_arabic_sent.xml arabic
./run_sentence.sh Sample_spanish_sent.xml spanish
Probably should make the args for the languages be follow ISO2 or 3 So EN, ES, AR, etc....
On Oct 31, 2017 22:18, "JingL1014" notifications@github.com wrote:
I updated the preprocessing code, so that it can run in Python 2 and Python 3. I have uploaded the GigaWord.sample.PETR_parsed.xml https://github.com/openeventdata/UniversalPetrarch/blob/master/UniversalPetrarch/data/text/GigaWord.sample.PETR_parsed.xml as well. Also I uploaded UDpipe model for three languages, and segmenter for Arabic. Please try again. Right now the command run_sentence.sh has two arguments, file name and language. ./run_sentence.sh GigaWord.sample.PETR.xml english ./run_sentence.sh Sample_arabic_sent.xml arabic ./run_sentence.sh Sample_spanish_sent.xml spanish
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openeventdata/UniversalPetrarch/issues/8#issuecomment-340967332, or mute the thread https://github.com/notifications/unsubscribe-auth/AJrP1hf7gpMTOI8Ws0cBp4pt8JXgqNhKks5sx-MdgaJpZM4QHoKK .
I've updated to the most recent code but now I'm getting an error with CoreNLP. I've changed the header of run_document.sh
to have all the same locations as the previous version, but now it can't load CoreNLP for some reason.
ahalterman:preprocessing$ ./run_sentence.sh GigaWord.sample.PETR.xml english
Prepare file for stanford CoreNLP
Call Stanford CoreNLP to do tokenization...
property file path:
config/StanfordCoreNLP-english.properties
Error: Could not find or load main class edu.stanford.nlp.pipeline.StanfordCoreNLP
In the most recent version of run_sentence.sh, only one location need to be changed, others are in the folder. Is STANFORD_CORENLP=/users/ljwinnie/toolbox/stanford-corenlp-full-2015-01-29
set properly?
I see. I was changing the ones in run_document.sh
but didn't see that it also has to be changed in run_sentence.sh
. I can get past the CoreNLP part, but now I'm hitting an error when it tries to write out:
WordsToSentencesAnnotator: 0.0 sec.
TOTAL: 0.2 sec. for 1411 tokens at 6950.7 tokens/sec.
Pipeline setup: 0.2 sec.
Total time for StanfordCoreNLP pipeline: 0.6 sec.
Generate sentence xml file...
Traceback (most recent call last):
File "preprocess.py", line 138, in <module>
read_doc_input(inputxml,inputparsed,outputfile)
File "preprocess.py", line 93, in read_doc_input
doc = doctexts[idx]
IndexError: list index out of range
I found the problem is that the latest CoreNLP(version 2017-06-09) has different output format than the older versions. I wrote the code and test using (version 2015-01-29). I just made a new commit and tested under both versions and the code works now.
Also, run_document.sh and run_sentence.sh are independent of each other. The difference is that the input of run_document.sh are articles and it does sentence splitting, tokenization, and parsing, while the input of run_sentence.sh are sentences and it only does tokenization and parsing.
Thanks! I can now run CoreNLP. I had to change some of the hardcoded paths in run_sentence.sh
to match my udpipe model locations, but the preprocessing worked fine.
I'm now running into an issue with UniversalPetrarch itself:
...
petr_log.PETRgraph: DEBUG [[u'---COP'], ['---'], '010']
petr_log.PETRgraph: DEBUG ['---']
Traceback (most recent call last):
File "petrarch_ud.py", line 409, in <module>
main()
File "petrarch_ud.py", line 71, in main
run(paths, out , True) ## <===
File "petrarch_ud.py", line 405, in run
updated_events = do_coding(events)
File "petrarch_ud.py", line 318, in do_coding
coded_events = sentence.get_events()
File "/Users/ahalterman/MIT/NSF_RIDIR/UniversalPetrarch/UniversalPetrarch/PETRgraph.py", line 1583, in get_events
if self.events[eventID][2] not in ['---',None,'None'] and self.events[eventID][2] != PETRglobals.VerbDict['verbs'][triplet['triple'][2].head.upper()]['#']['#']['code']:
KeyError: u'ISSUED'
I fixed the error and made the args of preprocessing code for the languages as EN, AR, ES as Dr. Brandt suggested.
I've successfully run the pipeline from start to finish (in English) with no errors. Thanks for everything!
UniversalPetrarch is running fine for me on the built in example and I can see it writing out the coding as it goes and the summary stats (woo!). I can't get it to write out to a file, though. I uncommented the line that seems to be response for writing out (here), but when I uncomment it I get the following error:
This dictionary organization/structure has never been great, so we might want to think about making it nicer if the fix indeed involves digging into it.