Extracting additional information

openeventdata / phoenix_pipeline

Turning news into events since 2014.

MIT License

50 stars 33 forks source link

Extracting additional information #101

Closed betatim closed 7 years ago

betatim commented 7 years ago

🎉 This looks like a very cool project! I found it via the discussion in JOSS.

(This is more a question than a bug report, if there is a better place for this than here please point me there so we can move.)

You have a very nice setup for recognising locations and geocoding them. How feasible is it to add extracting, say, people or other "entities" and coding them? If I wanted to start a project doing that should I build on phoenix_pipeline or do my own thing or? What I want to do is process news articles to extra locations, times, persons and "actions" from them. Trying to answer the question of "Who is doing what to whom where and when?"

philip-schrodt commented 7 years ago

The Stanford CoreNLP system, which the PETRARCH programs use as a pre-processor, routinely does some named-entity recognition (NER) and differentiates people, locations and organizations (probably some other things as well). You'd have to do a bit of additional processing to get these out of the standard CoreNLP output (though you will be doing so anyway to get the material into PETRARCH's format) but certainly CoreNLP would have done most of the work for you. I'm pretty sure they do some time markup as well.

The Stanford group (http://nlp.stanford.edu/software/) has a couple of additional NER systems, so you might look at these as well: I haven't used them but it looks like they are more customizable.

ahalterman commented 7 years ago

Just to chime in, you're in luck, because the phoenix_pipeline was designed to do exactly that task of extracting "who did what to whom". The Mordecai/geolocation step was a later add-on. The pipeline assumes that what comes in in the output of Stanford's CoreNLP (we do the NLP task here). From that, we use Petrarch2 to extract and code the actors and events. The website for the data we make is here.

betatim commented 7 years ago

Thanks! So if I wanted to fiddle with/roll my own instead of CoreNLP I'd replace that part and then put my own coding in petrarch2?

(repeating back at you what you said to check I got it :) )